Guiding Visual Question Answering with Attention Priors

05/25/2022
by   Thao Minh Le, et al.
0

The current success of modern visual reasoning systems is arguably attributed to cross-modality attention mechanisms. However, in deliberative reasoning such as in VQA, attention is unconstrained at each step, and thus may serve as a statistical pooling mechanism rather than a semantic operation intended to select information relevant to inference. This is because at training time, attention is only guided by a very sparse signal (i.e. the answer label) at the end of the inference chain. This causes the cross-modality attention weights to deviate from the desired visual-language bindings. To rectify this deviation, we propose to guide the attention mechanism using explicit linguistic-visual grounding. This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects. Here we learn the grounding from the pairing of questions and images alone, without the need for answer annotation or external grounding supervision. This grounding guides the attention mechanism inside VQA models through a duality of mechanisms: pre-training attention weight calculation and directly guiding the weights at inference time on a case-by-case basis. The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process. This scalable enhancement improves the performance of VQA models, fortifies their robustness to limited access to supervised data, and increases interpretability.

READ FULL TEXT
research
09/07/2023

Interpretable Visual Question Answering via Reasoning Supervision

Transformer-based architectures have recently demonstrated remarkable pe...
research
08/01/2018

Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining

A key aspect of VQA models that are interpretable is their ability to gr...
research
05/24/2023

Measuring Faithful and Plausible Visual Grounding in VQA

Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) sys...
research
05/11/2021

Found a Reason for me? Weakly-supervised Grounded Visual Question Answering using Capsules

The problem of grounding VQA tasks has seen an increased attention in th...
research
04/12/2020

A negative case analysis of visual grounding methods for VQA

Existing Visual Question Answering (VQA) methods tend to exploit dataset...
research
06/05/2023

Infusing Lattice Symmetry Priors in Attention Mechanisms for Sample-Efficient Abstract Geometric Reasoning

The Abstraction and Reasoning Corpus (ARC) (Chollet, 2019) and its most ...
research
09/20/2023

Sentence Attention Blocks for Answer Grounding

Answer grounding is the task of locating relevant visual evidence for th...

Please sign up or login with your details

Forgot password? Click here to reset