Guiding Visual Question Answering with Attention Priors

by   Thao Minh Le, et al.
Deakin University

The current success of modern visual reasoning systems is arguably attributed to cross-modality attention mechanisms. However, in deliberative reasoning such as in VQA, attention is unconstrained at each step, and thus may serve as a statistical pooling mechanism rather than a semantic operation intended to select information relevant to inference. This is because at training time, attention is only guided by a very sparse signal (i.e. the answer label) at the end of the inference chain. This causes the cross-modality attention weights to deviate from the desired visual-language bindings. To rectify this deviation, we propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. Here we learn the grounding from the pairing of questions and images alone, without the need for answer annotation or external grounding supervision. This grounding guides the attention mechanism inside VQA models through a duality of mechanisms: pre-training attention weight calculation and directly guiding the weights at inference time on a case-by-case basis. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process. This scalable enhancement improves the performance of VQA models, fortifies their robustness to limited access to supervised data, and increases interpretability.


Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

A key aspect of VQA models that are interpretable is their ability to gr...

Answer Questions with Right Image Regions: A Visual Attention Regularization Approach

Visual attention in Visual Question Answering (VQA) targets at locating ...

SOrT-ing VQA Models : Contrastive Gradient Learning for Improved Consistency

Recent research in Visual Question Answering (VQA) has revealed state-of...

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

The problem of grounding VQA tasks has seen an increased attention in th...

A negative case analysis of visual grounding methods for VQA

Existing Visual Question Answering (VQA) methods tend to exploit dataset...

Object-Centric Diagnosis of Visual Reasoning

When answering questions about an image, it not only needs knowing what ...

Adventurer's Treasure Hunt: A Transparent System for Visually Grounded Compositional Visual Question Answering based on Scene Graphs

With the expressed goal of improving system transparency and visual grou...

1 Introduction

Visual reasoning is the new frontier of AI wherein facts extracted from visual data are gathered and distilled into higher-level knowledge in response to a query. Successful visual reasoning methodology estimates the cross-domain association between the symbolic concepts and visual entities in the form of attention weights. Such associations shape the knowledge distillation process, resulting in a unified representation that can be decoded into an answer. In the exemplar reasoning setting known as Visual Question Answering (VQA), attention plays a pivotal role in modern systems

(anderson2018bottom; hudson2018compositional; kim2018bilinear; le2020dynamic; lu2016hierarchical). Ideal attention scores must be both relevant and effective: Relevance implies that attention is high when the visual entity and linguistic entity refer to the same concept; Effectiveness implies that the attention derived leads to good VQA performance.

However, in typical systems, the attention scores are computed on-the-fly: unregulated at inference time and guided at training time by the gradient from the groundtruth answers. Analysis of several VQA attention models shows that these attention scores are usually neither relevant nor guaranteed to be effective

(das2017human). The problem is even more severe when we cannot afford to have enough labeled answers due to the cost of the human annotation process. A promising solution is providing pre-computed guidance to direct and hint the attention mechanisms inside the VQA models towards more appropriate scores. Early works use human attention as the label for supervising machine attention (qiao2018exploring; selvaraju2019taking). This simple and direct attention perceived by humans is not guaranteed to be optimal for machine reasoning (firestone2020performance; fleuret2011comparing). Furthermore, because annotating attention is a complex labeling task, this process is inherently costly, inconsistent and unreliable (selvaraju2019taking). Finally, these methods only regulate the attention scores in training stage without directly adjust them in inference. Different from these approaches, we leverage the fact that such external guidance is pre-existing in the query-image pairs and can be extracted without any additional labels. Using pre-computed language-visual associations as an inductive bias for attention-based reasoning without further extra labeling remains a desired but missing capability.

Figure 1: We introduce Grounding-based Attention Prior (GAP) mechanism (blue box) which considers the linguistic-visual associations between a pair of VQA query and image and refines the attentions inside reasoning model (gray box). This boosts the performance of VQA models, reduces their reliance on supervised data and increases their interpretability.

Exploring this underlying linguistic-visual association for VQA, we aim to distill the compatibility between entities across input modalities in an unsupervised manner from the query-image pairs without explicit alignment grouthtruths, and use this knowledge as an inductive bias for the attention mechanism thus boosting reasoning capability. To this end, we design a framework called Grounding-based Attention Prior (GAP) to (1) extract the alignments between linguistic-visual region pairs and (2) use these pair-wise associations as an inductive bias to guide VQA’s attention mechanisms.

For the first task, we exploit the pairing between the questions and the images as a weakly supervised signal to learn the mapping between words and image regions. By exploiting the implicit supervising signals from the pairing, this requires no further annotation. To overcome the challenge of disparity in the co-inferred semantics between query words and image regions, we construct a parse tree of the query, extract the nested phrasal expressions and ground them to image regions. These expressions semantically match image regions better than single words and thus create a set of more reliable linguistic-visual alignments.

The second task aims at using these newly discovered alignments to guide reasoning attention. This guidance process is provided through two complementary pathways. First, we pre-train attention weights to align with the pre-computed grounding. This step is done in an unsupervised manner without access to the answer groundtruths. Second, we use the attention prior to directly regulate and refine the attention weights guided by the groundtruth answer through back-propagation to not deviate too far away from it. This is modulated by a learnable gate. These dual guidance pathways are a major advancement from previous attention regularization methods (selvaraju2019taking; wu2019self) as the linguistic-visual compatibility is leveraged directly and flexibly in both training and inference rather than simply as just regularization.

Through extensive experiments, we prove that this methodology is effective in both discovering the grounding and using them to boost the performance of attention-based VQA models across representative methods and datasets. These improvements surpass other methods’ performance and furthermore require no extra annotation. The proposed method also significantly improves the sample efficiency of VQA models, hence less annotated answers are required. Fig. 1 illustrates the intuition and design of the method with an example of the improved attention and answer.

Our key contributions are:

1. A novel framework to calculate linguistic-visual alignments, providing pre-computed attention priors to guide attention-based VQA models;

2. A generic technique to incorporate attention priors into most common visual reasoning methods, fortifying them in performance and significantly reducing their reliance on human supervision; and,

3. Rigorous experiments and analysis on the relevance of linguistic-visual alignments to reasoning attention.

2 Related Work

Attention-based models are the most prominent approaches in VQA. Simple methods (anderson2018bottom) only used single-hop attention mechanism to help machine select relevant image features. More advanced methods (yang2016stacked; hudson2018compositional; le2020dynamic) and those relying on memory networks (xiong2016dynamic; xu2016ask) used multi-hop attention mechanisms to repeatedly revise the selection of relevant visual information. BAN (kim2018bilinear) learned a co-attention map using expensive bilinear networks to represent the interactions between pairs of word-region. One drawback of these attention models is that they are only supervised by the answer groundtruth without explicit attention supervision.

Attention supervision is recently studied for several problems such as machine translation (liu2016neural)

and image captioning

(liu2017attention; ma2020learning; zhou2020more). In VQA, attentions can be self-regulated through internal constraints (ramakrishnan2018overcoming; liu2021answer). More successful regularization methods use external knowledge such as human annotations on textual explanations (wu2019self) or visual attention (qiao2018exploring; selvaraju2019taking). Unlike these, we propose to supervise VQA attentions using pre-computed language-visual grounding from image-query pairs without using external annotation.

Linguistic-visual alignment includes the tasks of text-image matching (lee2018stacked), grounding referring expressions (yu2018mattnet) and cross-domain joint representation (lu2019vilbert; su2020vl). These groundings can support tasks such as captioning (zhou2020more; karpathy2015deep). Although most tasks are supervised by human annotations, contrastive learning (gupta2020contrastive; wang2021improving) allows machines to learn the associations between words and image regions from weak supervision of phrase-image pairs. In this work, we propose to explore such associations between query and image in VQA. This is a new challenge because the query is complex and harder to be grounded, therefore new method using grammatical structure will be devised.

Our work also share the Knowledge distillation paradigm (hinton2015distilling) with cross-task (albanie2018emotion) and cross modality (gupta2016cross; liu2018multi; wang2020improving) adaptations. Particularly, we distill visual-linguistic grounding and use it as an input for VQA model’s attention. This also distinguishes our work from the recent self-supervised pretraining methods (tan2019lxmert; li2020oscar) where they focus on a unified representation for a wide variety of tasks thanks to the access to enormous amount of data. Our work is theoritically applicable to complement the multimodal matching inside these models.

3 Preliminaries

A VQA system aims to deduce an answer about an image in response to a linguistic question , , via . The query is typically decomposed into a set of linguistic entities . These entities and the query

are then embedded into a feature vector space:

. In the case of sequential embedding popularly used for VQA, entities are query words; they are encoded with GloVe for word-level embedding (pennington2014glove) followed by RNNs such as BiLSTM for sentence-level embedding. Likewise the image is often segmented into a set of visual regions with features by an object detector, i.e., Faster R-CNN (ren2015faster). For ease of reading, we use the dimension for both linguistic embedding vectors and visual representation vectors.

A large family of VQA systems (lu2016hierarchical; anderson2018bottom; hudson2018compositional; le2020dynamic; kim2018bilinear; kim2016hadamard) rely on attention mechanisms to distribute conditional computations on linguistic entities and visual counterparts

. These models can be broadly classified into two groups:

joint- and marginalized- attention models. lu2016hierarchical; anderson2018bottom; hudson2018compositional are among those who fall into the former, while kim2018bilinear; kim2016hadamard and Transformer-based models (tan2019lxmert) are typical representative of the works in the latter category.

Joint attention models

The most complete attention model includes a detailed pair-wise attention map indicating the contextualized correlation between word-region pairs used to estimate the interaction between visual and linguistic entities for the combined information. These attention weights are in the form of a 2D matrix . They often contain fine-grained relationships between each linguistic word to each visual region. The attention matrix is derived by a sub-network as , where each denotes the correlation between the linguistic entities and the visual region , and is network parameters of VQA models. Joint attention models contain the rich pairwise relation and often perform well. However, calculating and using this full matrix has a large overhead computation cost. A good approximation of this matrix is the marginalized vectors over rows and columns which is described next.

Marginalized attention models

Conceptually, the matrix is marginalized along columns into the linguistic attention vector and along rows into visual attention vector . In practice, and are calculated directly from each pair of input image and query through dedicated attention modules. They can be implemented in different ways such as direct single-shot attention (anderson2018bottom), co-attention (lu2016hierarchical) or multi-step attention (hudson2018compositional). In our experiment, we concentrate on two popular mechanisms: single-shot attention where the visual attention is calculated directly from the inputs and alternating attention mechanism where the visual attention follows the linguistic attention (lu2016hierarchical). Concretely, is estimated first, followed by the attended linguistic feature of the entire query ; then this attended linguistic feature is used to calculate the visual attention . The alternating mechanism can be extended with multi-step reasoning (hudson2018compositional; le2020dynamic; hu2019language). In such case, a pair of attentions and are estimated at each reasoning step forming a series of them.

Figure 2: Overall architecture of a generic joint attention VQA model using Grounding-based Attention Prior (GAP) to guide the computation of attention weights. Vision-language compatibility pre-computed by an unsupervised framework (green boxes) serves as an extra source of information, providing inductive biases to guide attention weights inside attention-based VQA models towards more meaningful alignment.
Answer decoder

Attention scores drive the reasoning process producing a joint linguistic-visual representation on which the answer is decoded: (“att_scores” refers to either visual attention vector or attention matrix ). For marginalized attention models, the function

is a neural network taking as input the query representation

and the attended visual feature to return a joint representation. Joint attention models instead use the bilinear combination to calculate each component of the output vector of (kim2018bilinear):


where is the index of output components and and are learnable weights.

4 Methods

We now present Grounding-based Attention Priors (GAP), an approach to extract the concept-level association between query and image and use this knowledge as attention priors to guide and refine the cross-modality attentions inside VQA systems. The approach consists of two main stages. First, we learn to estimate the linguistic-visual alignments directly from question-image pairs (Sec. 4.1, green boxes in Fig. 2). Second, we use such knowledge as inductive priors to assist the computation of attention in VQA (Sec. 4.2, Sec. 4.3, and lower parts in Fig. 2).

4.1 Structures for Linguistic-Visual Alignment

Grammatical structures for grounding. The task of Linguistic-visual Alignment aims to find the groundings between the linguistic entities (, query words in VQA) and vision entities (, visual regions in VQA) in a shared context. This requires the interpretation of individual words in the complex context of the query so that they can co-refer to the same concepts as image regions. However, compositional queries have complex structures that prevent state-of-the-art language representation methods from fully understanding the relations between semantic concepts in the queries (reimers2019sentence). We propose to better contextualize query words by breaking a full query into phrases that refer to simpler structures, making the computation of word-region grounding more effective. These phrases are called referring expressions (RE) (mao2016generation) and were shown to co-refer well to image regions (kazemzadeh2014referitgame). The VQA image-query pairing labels are passed to the REs of such query. We then ground words with contextualized embeddings within each RE to their corresponding visual regions. As the REs are nested phrases from the query, a word can appear in multiple REs. Thus, we obtain the query-wide word-region grounding by aggregating the grounding of REs containing the word. See Fig. 3 for an example on this process.

Figure 3: The query is parsed into a constituency parse tree to identify REs. Each RE serves as a local context for words. Words within each RE context are grounded to corresponding image regions. A word can appear in multiple REs, and thus its final grounding is averaged over containing REs, serving as inductive prior for VQA.

We extract query REs using a constituency parse tree (cirik2018using).111Berkeley Neural Parser (kitaev2018constituency) in our implementation. In this structure, the query is represented as a set of nested phrases corresponding to subtrees of . The parser also provides the grammatical roles of the phrases. For example, the phrase “the white car” will be tagged as a noun-phrase while “standing next to the white car” is a verb-phrase. As visual objects and regions are naturally associated with noun-phrases, we select a set of all the noun phrases and wh-noun phrases222noun phrases prefixed by a pronoun, , “which side”, “whose bag”. as the REs.

We denote the RE as where and are the start and end word index of the RE within the query . It has length . We now estimate the correlation between words in these REs and the visual regions by learning the neural association function of parameter that generates a mapping between words in the RE and the corresponding visual regions.

We implement as the dot products of a contextualized embedding of word in with the representations of regions in , following the scaled dot-product attention (vaswani2017attention).

Unsupervised training. To train the function , we adapt the recent contrastive learning framework (gupta2020contrastive) for phrase grounding to learn these word-region alignments from the RE-image pairs in an unsupervised manner, i.e. without explicit word-region annotations. In a mini batch of size , we calculate the positive mapping on one positive sample (the RE and the image regions in the image that is paired with it) and negative mappings where from negative samples (the RE and negative image regions from images that are not paired with it). We then compute linguistic-induced visual representations and over regions for each word :


where “” is a column normalization operator; and are learnable parameters. We then push them away from each other by maximizing the linguistic-vision InfoNCE (oord2018representation):


This loss maximizes the lower bound of mutual information between visual regions and contextualized word embedding (gupta2020contrastive).

Finally, we compute the word-region alignment by aggregating the RE-image groundings:



is the zero-padded matrix of the matrix


Besides making grounding more expressive, this divide-and-conquer strategy has the extra benefit of augmenting the weak supervising labels from query-image to RE-image pairs, which increase the amount of supervising signals (positive pairs) and facilitate better training of the contrastive learning framework.

The discovered grounding provides a valuable source of priors for VQA attention. Existing works (qiao2018exploring; selvaraju2019taking) use attention priors to regulate the gradient flow of VQA models during training, hence only constraining the attention weights indirectly. Unlike these methods, we directly guide the computation of attention weights via two pathways: through pre-training them without answers, and by refining in VQA inference on a case-by-case basis.

4.2 Pre-training VQA Attention

A typical VQA system seeks to ground linguistic concepts parsed from the question to the associated visual parts through cross-modal attention. However, this attention mechanism is guided only indirectly and distantly through sparse training signal of the answers. This training signal is too weak to assure that relevant associations can be discovered. To directly train the attention weights to reflect these natural associations, we pre-train VQA models by enforcing the attention weights to be close to the alignment maps discovered through unsupervised grounding in Sec. 4.1.

For joint attention VQA models, this is achieved through minimizing the Kullback-Leibler divergence between vectorized forms of the VQA visual attention weights

and the prior grounding scores :


where flattens a matrix into a vector followed by a normalization operator ensuring such vector sums to one.

For VQA marginalized attention models, we first marginalize into a vector of visual attention prior:


The pre-training loss is the KL divergence between the attention weights and their priors:


4.3 Attention Refinement with Attention Priors

4.3.1 Marginalized attention refinement

Recall from Sec. 3 that a marginalized attention VQA model computes linguistic attention over query words and visual attention over visual regions . In this section, we propose to directly refine these attentions using attention priors learned in Sec. 4.1. First, is marginalized over rows and columns to obtain a pair of attention priors vectors and :


We then refine and inside the reasoning process through a gating mechanism to return refined attention weights and in two forms:

Additive form:
Multiplicative form:

where “norm” is a normalization operator; and are outputs of learnable gating functions that decide how much attention priors contribute per words and regions. Intuitively, these gating mechanisms are a solution to maximizing the agreement between two sources of information: where

measures the distance between two probability distributions

and . When Euclidean distance, it gives Eq. (10) and when KL divergence between the two distributions, it is Eq. (11) (heskes1998selecting) (See the Supplement for detailed proofs). The same intuition applies for the calculation of .

The learnable gates for and are implemented as a neural function of visual regions and the question :


For simplicity, is the arithmetic mean of regions in .

For multi-step reasoning, we apply Eqs. (10, 11) step-by-step. Since each reasoning step is driven by an intermediate controlling signal (Sec. 3), we adapt the gating functions to make use of that signal:

4.3.2 Joint attention refinement

In joint attention VQA models, we can directly use matrix without marginalization. With slight abuse of notation, we denote the output the modulating gate for attention refinement as sharing similar role with the gating mechanism in Eq. (12):


where .

4.4 Two-stage Model Training

We perform a two-step pre-training/fine-tuning procedure to train models using the attention priors: (1) unsupervised pre-training VQA without answer decoder with attention priors (Sec. 4.2), and (2) fine-tune full VQA models with attention refinement using answers, i.e. by minimizing the VQA loss .

5 Experiments

Method VQA v2 standard val
All Yes/No Num Other
UpDn+Attn. Align (selvaraju2019taking) 63.2 81.0 42.6 55.2
UpDn+AdvReg (ramakrishnan2018overcoming) 62.7 79.8 42.3 55.2
UpDn+SCR (w. ext.) (wu2019self) 62.2 78.8 41.6 54.5
UpDn+SCR (w/o ext.) (wu2019self) 62.3 77.4 40.9 56.5
UpDn+DLR (jing2020overcoming) 58.0 76.8 39.3 48.5
UpDn+RUBi (cadene2019rubi) 62.7 79.2 42.8 55.5
UpDn+HINT (selvaraju2019taking) 63.4 81.2 43.0 55.5
UpDn+GAP 64.3 81.2 44.1 56.9
Table 1: Performance comparison between GAP and other attention regularization methods using UpDn baseline on VQA v2. Results of other methods are taken from their respective papers. Our reproduced results.

We evaluate our approach (GAP) on two representative marginalized VQA models: Bottom-Up Top-Down Attention (UpDn) (anderson2018bottom) for single-shot, MACNet (hudson2018compositional) for multi-step compositional attention models; and a joint attention model of BAN (kim2018bilinear). Experiments are on two datasets: VQA v2 (goyal2017making) and GQA (hudson2019gqa). Unless stated otherwise, we choose the additive gating (Eq. (10) ) for experiments with UpDn and MACNet, and multiplicative forms (Eq. (11)) for BAN. Implementation details and extra results are provided in the supplementary materials.

5.1 Experimental Results

Figure 4: GAP’s universality across different baselines and datasets. Figure 5: GAP improves generalization capability with limited access to grouthtruth answers.
Enhancing VQA performance

We compare GAP against the VQA models based on the UpDn baseline that utilize external priors and human annotation on VQA v2. Some of these methods use internal regularization: adversarial regularization (AdvReg) (ramakrishnan2018overcoming), attention alignment (Attn. Align) (selvaraju2019taking); and some use human attention as external supervision: self-critical reasoning (SCR) (wu2019self) and HINT (selvaraju2019taking). While these methods mainly aim at designing regularization schemes to exploit the underlying data generation process of VQA-CP datasets agrawal2018don where it deliberately builds the train and test splits with different answer distributions. This potentially leads to overfitting to the particular test splits and accuracy gains do not correlate to the improvements of actual grounding (shrestha2020negative). On the contrary, GAP does not rely on those regularization schemes but aims at directly improving the learning of attention inside any attention-based VQA models to facilitate reasoning process. In other words, GAP complements the effects of the aforementioned methods on VQA-CP (See the supplemental materials).

Table 1 shows that our approach (UpDn+GAP) clearly has advantages over other methods in improving the UpDn baseline. The favorable performance is consistent across all question types, especially on “Other” question type, which is the most important and challenging for open-ended answers (teney2021unshuffling; teney2020value).

Compared to methods using external attention annotations (UpDn+SCR, UpDn+HINT), the results suggest that GAP is an effective way to use of attention priors (in both learning and inference), especially when our attention priors are extracted in an unsupervised manner without the need for human annotation.

Universality across VQA models

GAP is theoretically applicable to any attention-based VQA models. We evaluate the universality of GAP by trialing it on a wider range of baseline models and datasets. Fig. 4 summarizes the effects of GAP on UpDn, MACNet and BAN on the large-scale datasets VQA v2 and GQA.

It is clear that GAP consistently improves upon all baselines over all datasets. GAP is beneficial not only for the simple model UpDn, but also for the multi-step model (MACNet) on which the strongest effect is when applied only in early reasoning steps where the attention weights are still far from convergence.

Between datasets, the improvement is stronger on GQA than on VQA v2, which is explained by the fact that GQA has a large portion of compositional questions which our unsupervised grounding learning can benefit from.

The improvements are less significant with BAN which already has large capacity model at the cost of data hunger and computational expensiveness. In the next section, we show that GAP significantly reduces the amount of supervision needed for these models compared to the baseline.

Model R@1 R@5 R@10 Acc.
Unsupervised RE-image grounding 14.1 35.6 45.5 45.4
Unsupervised grounding w/o REs 12.0 33.0 42.9 44.3
Random alignment score (10 runs) 6.6 28.4 43.3 40.7
Table 2: Grounding performance of the unsupervised RE-image grounding when evaluated on out-of-distribution image-caption Flickr30K Entities test set. Recall@: fraction of phrases with bounding boxes that have IOU0.5 with top- predictions.
Sample efficient generalization

We examine the generalization of the baselines and our proposed methods when analyzing sample efficiency with respect to the number of annotated answers required. Fig. 5 shows the performance of the chosen baselines on the validation sets of VQA v2 (left column) and GQA dataset (right column) when given different fractions of the training data. In particular, when reducing the number of training instances with groundtruth answers to under 50% of the training set, GAP considerably outperforms all the baseline models in accuracy across all datasets by large margins. For example, when given only 10% of the training data, GAP performs better than the strongest baseline BAN among the chosen ones by over 4.1 points on VQA v2 (54.2% vs. 50.1%) and nearly 4.0 points on GQA (51.7% vs. 47.9%). The benefits of GAP are even more significant for MACNet baseline which easily got off the track in the early steps without large data. The results strongly demonstrate the benefits of GAP in reducing the reliance on supervised data of VQA models.

5.2 Model Analysis

Performance of unsupervised phrase-image grounding

To analyze the unsupervised grounding aspect of our model, (Sec. 4.1), we test the grounding model trained with VQA v2 on a mock test set from caption-image pairs on Flickr30K Entities. This out-of-distribution evaluation setting will show whether our unsupervised grounding framework can learn meaningful linguistic-visual alignments.

The performance of our new unsupervised linguistic-visual alignments using the query grammatical structure is shown in the top row of Table 2. This is compared against the alignment scores produced by the same framework but without breaking the query into REs (Middle row) and the random alignments (Bottom row). There is a 5 points gain compared to the random scores and over 1 point better than the question-image pairs without phrases, indicating our linguistic-visual alignments is a reliable inductive prior for attention in VQA.

Table 3: VQA performance on VQA v2 validation split with different sources of attention priors. No. Models Acc. 1 UpDn baseline 63.3 2 +GAP w/ uniform-values vector 63.7 3 +GAP w/ random-values vector 63.6 4 +GAP w/ supervised grounding 64.0 5 +GAP w/ unsup. visual grounding 64.3 Table 4: Ablation studies with UpDn on VQA v2. Models Acc. 1.UpDn baseline, () 63.3 Attention as priors 2.w/ () 60.0 Effects of the direct use of attention priors 3.+GAP w/o 1st stage fine-tuning 63.9 4.w/ 1st stage fine-tuning with attention priors 64.0 Effects of the gating mechanisms 5.+GAP, fixed 64.0 6.+GAP (multiplicative gating) 64.1 Effects of using visual-phrase associations 7.+GAP (w/o extracted phrases from questions) 63.9 8.+GAP (full model) 64.3
Effectiveness of unsupervised linguistic-visual alignments for VQA

We examine the effectiveness of our attention prior by comparing it with different ways of generating values for visual attention prior

on VQA performance. They include: (1) UpDn baseline (no use of attention prior) (2) uniform-values vector and (3) random-values vector (normalized normal distribution), (4) supervised grounding (pretrained MAttNet

(yu2018mattnet) on RefCOCO (kazemzadeh2014referitgame)), and (5) GAP. Table 3 shows results on UpDn baseline. GAP is significantly better than the baseline and other attention priors (2-3-4). Especially our unsupervised grounding gives better VQA performance than the supervised one (Row 5). This surprising result suggests that pre-trained supervised model could not generalize out of distribution, and is worse than underlying grounding phrase-image pairs extracted unsupervisedly.

Ablation studies

To provide more insights into our method, we conduct extensive ablation studies on the VQA v2 dataset (see Table

4). Throughout these experiments, we examine the role of each component toward the optimal performance of the full model. Experiments (1, 2) in Table 4 show that UpDn model does not perform well with either only its own attention or with the attention prior itself. This supports our intuition that they complement each other toward optimal reasoning. Rows 5,6 show that a soft combination of the two terms is necessary.

Row 7 justifies the use of structured grounding. It shows that phrase-image grounding gives better performance than question-image pairs only. In particular, the extracted RE-image pairs improves performance from 63.9% to 64.3%. This clearly demonstrates the significance of the grammatical structure of questions as an inductive bias for inter-modality matching which eventually benefits VQA.

Model Top-1 attention Top-5 attention Top-10 attention
UpDn baseline 14.50 27.31 35.35
UpDn + GAP 16.76 29.32 36.53
Table 5: Grounding scores of UpDn before and applying GAP on GQA validation split.
Quantitative results

We quantify the visual attentions of the UpDn model before and after applying GAP on the GQA validation set. In particular, we use the grounding score proposed by (hudson2019gqa) to measure the correctness of the model’s attentions weights comparing to the groundtruth grounding provided. Results are shown in Table 5. Our method improves the grounding scores of UpDn by 2.26 points (16.76 vs. 14.50) for top-1 attention, 2.01 points (29.32 vs. 27.31) for top-5 attention and 1.18 points (36.53 vs. 35.35) for top-10 attention. It is to note that while the grounding scores reported by (hudson2019gqa) summing over all object regions, we report the grounding scores attributed by top- attentions to better emphasize how the attentions shift towards most relevant objects. This analysis complements the VQA performance in Table 3 in a more definitive confirmation of the role of GAP in improving both reasoning attention and VQA accuracy.

Figure 6: Qualitative analysis of GAP. (a) Region-word alignments of different RE-image pairs learned by our unsupervised grounding framework. (b) Visual attentions and prediction of UpDn model before (left) vs. after applying GAP (right). GAP shifts the model’s highest visual attention (green rectangle) to more appropriate regions while the original puts attention on irrelevant parts.
Qualitative results

We analyze the internal operation of GAP by visualizing grounding results on a sample taken from the GQA validation set. The quality of grounding is demonstrated in Fig. 6(a) with the word-region alignments found for several RE-image pairs. With GAP, these good grounding eventually benefits VQA models by guiding their visual attentions. Fig. 6(b) shows visual attention of the UpDn model before and after applying GAP. The guided attentions were shifted towards more appropriate visual regions than attentions by UpDn baseline.

6 Conclusion

We have presented a generic methodology to semantically enhance cross-modal attention in VQA. We extracted the linguistic-vision associations from query-image pairs and used it to guide VQA models’ attention with Grounding-based Attention Prior (GAP). Through extensive experiments across large VQA benchmarks, we demonstrated the effectiveness of our approach in boosting attention-based VQA models’ performance and mitigating their reliance on supervised data. We also showed qualitative analysis to prove the benefits of leveraging grounding-based attention priors in improving the interpretability and trustworthiness of attention-based VQA models. Broadly, the capability to obtain the associations between words and vision entities in the form of common knowledge is key towards systematic generalization in joint visual and language reasoning.


Appendix A Method details

a.1 Language and Visual Embedding

Textual embedding

Given a length- question, we first tokenize it into a sequence of words and further embed each word into the vector space of 300 dimensions. We initialize the word embeddings with the popular pre-trained vector representations in GloVe (pennington2014glove).

To model the sequential nature of the query, we use bidirectional LSTMs (BiLSTMs) taking as input the word embedding vectors. The BiLSTMs result in hidden states and at a time step for the forward pass and backward pass, respectively. We further combine every pair and into a single vector , where indicates the vector concatenation operation. The contextual words are then obtained by gathering these combined vectors . The global representation of the query is a combination of the ends of the LSTM passes .

For our grounding framework with contrastive learning, we use a contextualized word representation extracted by a pre-trained BERT language model (devlin2018bert) for each word in an extracted RE. These contextualized embeddings are found to be more effective for phrase grounding (gupta2020contrastive).

Visual embedding

Visual regions are extracted by the popular object detection Faster R-CNN (ren2015faster) pre-trained on Visual Genome (krishna2017visual). We use public code333 making use of the Facebook Detectron2 v2.0.1 framework444 for this purpose. For each image, we extract a set of RoI pooling features with bounding boxes , where are appearance features of object regions and bounding box’s coordinators, respectively. We follow (yu2017joint)

to encode the bounding box’s coordinators into a spatial vector of 7 dimensions. We further combine the appearance features with the encoded spatial features by using a sub-network of two linear transformations to obtain a set of visual objects

, where is the vector length of the joint features of the appearance features and the spatial features. For ease of reading and implementation, we choose the linguistic feature size and the visual feature size to be the same.

a.2 Meaning of Attention Refinement Mechanisms

In this section, we will give a proof that our choices for attention refinement in Eqs. (10, 11 and 14

) in the main paper are the optimal solutions with respect to some criteria for probability estimate aggregation.

Let us consider the generic problem where a system has multiple estimates of a true discrete distribution from multiple mechanisms with corresponding degrees of certainty . We first normalize these certainty measures so that they sum to one: . We aim at finding a common distribution aggregating the set of distributions subject to a item-to-set distance :


This problem can be solved for particular choices of the set-distance function that measures the discrepancy between and the set under the confidence weights

. We consider several heuristic choices of the function


Additive form:

If we define the distance to be Euclidean distance from to each member distribution of the set, then the minimized term becomes


Here, we minimize w.r.t. :

By setting this gradient to zero, we have . This explains for the additive form of our attention refinement mechanism in Eqs. (10 and 14 (upper part)) in the main paper where we seek a solution that best agrees with the grounding prior and the model-induced probability .

Multiplicative form:

If we define as the weighted sum of the KL divergences between to each member distribution of the set:


Here, we minimize a Lagrangian with the multiplier of w.r.t. ,


Setting this gradient to zero leads to:


where is a calculable constant to normalize such that its components sum to one. This explains for the multiplicative form of our attention refinement mechanism in Eqs. (11 and 14 (lower part)) in the main paper.

a.3 Neural Gating Functions

Here we provide the details of implementation choices for the neural gating functions in Eqs. (12 and 13). In particular, we use element-wise product between embedded representations of the input and as following:


where are learnable weights, are biases,

is the sigmoid function and

denotes the Hadamard product. ELU (clevert2015fast)

is a non-linear activation function.

For multi-step reasoning, we additionally takes as input the intermediate controlling signal at reasoning step . Output of the modulating gate in Eq. 13 in the main paper is given by




Here are learnable weights, and are biases.

Appendix B Experiment details

b.1 Datasets

VQA v2

is a large scale VQA dataset entirely based on human annotation and is the most popular benchmark for VQA models. It contains 1.1M questions with more than 11M answers annotated from over 200K MSCOCO images (lin2014microsoft), of which 443,757 questions, 214,354 questions and 447,793 questions in train, val and test split, respectively.

We choose correct answers in the training split appearing more than 8 times, similar to prior works (teney2018tips; anderson2018bottom). We report performance as accuracy calculated by standard VQA accuracy metric: (antol2015vqa).


is currently the largest VQA dataset. The dataset contains over 22M question-answer pairs and over 113K images covering various reasoning skills and requiring multi-step inference, hence significantly reducing biases as in previous VQA datasets. Each question is generated based on an associated scene graph and pre-defined structural patterns. GQA has served as a standard benchmark for most advanced compositional visual reasoning models (hudson2019gqa; hu2019language; hudson2019learning; shevchenko2020visual). We use the balanced splits of the dataset in our experiments.

b.2 Baseline Models

Bottom-Up Top-Down Attention (UpDn)

UpDn is the first to introduce the use of bottom-up attention mechanism to VQA

by utilizing image region features extracted by Faster R-CNN

(ren2015faster) pre-trained on Visual Genome dataset krishna2017visual. A top-down attention network driven by the question representation is used to summarize the image region features to retrieve relevant information that can be decoded into an answer. The UpDn model won the VQA Challenge in 2017 and became a standard baseline VQA model since then.


MACNet is a multi-step co-attention based model to perform sequential reasoning where they use VQA

as a testbed. Given a set of contextual word embeddings and a set of visual region features, at each time step, an MAC cell learns the interactions between the two sets with the consideration of their past interaction at previous time steps through a memory. In particular, an MAC cell uses a controller to first compute a controlling signal by summarizing the contextual embeddings of the query words using an attention mechanism. The controlling signal is then coupled with the memory state of the previous reasoning step to drive the computation of the intermediate visual attention scores. At the end of a reasoning step, the retrieved visual feature is finally used to update the memory state of the reasoning process. The process is repeated over multiple steps, resembling the way humans reason over a compositional query. In our experiments, we use a Pytorch equivalent implementation


of MACNet instead of using the original Tensorflow-based implementation. We choose the number of reasoning steps to be

in all experiments. For experiments with GAP, we only refine the attention weights inside the controller (linguistic attention) and the read module (visual attention) at the first reasoning step where the grounding prior shows its best effect in accelerating the learning of attention weights, hence leads to the best performance overall.

Bilinear Attention Networks (BAN)

BAN is one of the most advanced VQA models based on low-rank bilinear pooling. Given two input channels (language and vision in the VQA setting), BAN uses low-rank bilinear pooling to extract the pair-wise cross interactions between the elements of the inputs. It then produces an attention map to selectively attend to the pairs that are most relevant to the answer. BAN also takes advantage of a multimodal residual networks to improve its performance by repeatedly refining the retrieved information over multiple attention maps. We use its official implementation666 in our experiments. In order to make the best judgments of the model’s performance with our attention refinement with grounding priors, we remove the plug-and-play counting module (zhang2018learning) in the original implementation.

Regarding the choice of hyper-parameters, all the experiments regardless of the baselines are with =512. The number of visual objects for each image is and the maximum number of words in a query is set to be the length of the longest query in the respective dataset. We train all the models using Adam optimizer with a batch size of . The learning rate is initialized at and scheduled with the warm up strategy, similar to prior words in VQA (jiang2018pythia)

. Reported results are at the epoch that gives the best accuracy on the validation sets after

training epochs.

b.3 Additional Experimental Results

Apart from the experimental results in Sec. 5 in the main paper, we provide additional results on VQA-CP2 dataset (agrawal2018don) to support our claim that GAP complements related works with regularization schemes. We choose RUBi (cadene2019rubi) as a representative bias reduction method for VQA with a general linguistic debiasing technique and yet effective on VQA-CP2 dataset. Table 6 presents our experimental results with UpDn baseline. As clearly seen, even though linguistic biases are not the main target, GAP still shows consistent improvements on top of both UpDn baseline and UpDn+RUBi baseline. It is to emphasize that applying the regularization by RUBi for linguistic bias considerably hurts the performance on VQA v2 even though RUBi largely improves performance on VQA-CP2 test split. GAP brings the benefits of pre-computed attention priors and rejects the damage caused by the regularization effects by RUBi to maintain its excellent performance on VQA v2 while slightly improving the baseline’s performance on VQA-CP2. Looking more closely at the results per question type on VQA-CP2 (Row 1 vs. Row 2, and Row 3 vs. Row 4), GAP shows its universal effect on all question types with the strongest effect on “Other” question type which contains open-ended arbitrary questions. On the other hand, RUBi (Row 3 vs. Row 1) shows its significant impact only on binary questions “Yes/No” but considerably hurts “Number” and especially “Other” question types. This reveals that the regularization scheme in RUBi is overfitted to “Yes/No” questions specifically due to the limitation of the data generation process behind this dataset.

The analysis in this Section is consistent with our results in Figure 4 in the main paper and is clearly evident to GAP’s universal effects in improving VQA performance. The additional results with RUBi shown in this Section also state GAP’s complementary benefits upon the use of the learning regularization methods targeting only a specific type of data as such in VQA-CP2.

Model VQA-CP2 test VQA v2 val
Overall Yes/No Number Other Overall Yes/No Number Other
UpDn baseline 40.6 41.2 13.0 48.1 63.3 79.7 42.8 56.4
UpDn+GAP 40.8 41.2 13.2 48.3 64.3 81.2 44.1 56.9
UpDn+RUBi 48.6 72.1 12.6 46.1 62.7 79.2 42.8 55.5
UpDn+RUBi+GAP 48.9 72.2 12.8 46.4 64.2 81.4 44.3 56.3
Table 6: Performance on VQA v2 val split and VQA-CP2 test split with UpDn baseline.

b.4 Additional Qualitative Analysis

Figure 7: Qualitative analysis of GAP with UpDn baseline. (a) Region-word alignments of different RE-image pairs learned by our unsupervised grounding framework. (b) Visual attentions and prediction of UpDn model before (left) vs. after applying GAP (right). GAP shifts the model’s highest visual attention (green rectangle) to more appropriate regions while the original puts attention on irrelevant parts.
Figure 8: Qualitative analysis of GAP with MACNet baseline. (a) Region-word alignments of different RE-image pairs learned by our unsupervised grounding framework. (b) Visual attentions and prediction of MACNet model before (left) vs. after applying GAP (right). Visualized attention weights are obtained at the last reasoning step of MACNet.

Fig.6 in the main paper provides one case of visualization on the internal operation of our proposed method GAP as well as its effect on VQA models. We provide more examples here for UpDn baseline (Fig. 7) and MACNet baseline (Fig. 8) with the same convention and legends.

In each figure, left subfigures present the linguistic-visual alignments learned by our unsupervised grounding framework. Right subfigures compare the visual attentions before and after applying GAP. In all cases across two different baselines (UpDn and MACNet), GAP clearly helps direct the models to pay attention to more appropriate visual regions, partly explaining their answer predictions.