Multimodal Continuous Visual Attention Mechanisms

04/07/2021
by   António Farinhas, et al.
6

Visual attention mechanisms are a key component of neural network models for computer vision. By focusing on a discrete set of objects or image regions, these mechanisms identify the most relevant features and use them to build more powerful representations. Recently, continuous-domain alternatives to discrete attention models have been proposed, which exploit the continuity of images. These approaches model attention as simple unimodal densities (e.g. a Gaussian), making them less suitable to deal with images whose region of interest has a complex shape or is composed of multiple non-contiguous patches. In this paper, we introduce a new continuous attention mechanism that produces multimodal densities, in the form of mixtures of Gaussians. We use the EM algorithm to obtain a clustering of relevant regions in the image, and a description length penalty to select the number of components in the mixture. Our densities decompose as a linear combination of unimodal attention mechanisms, enabling closed-form Jacobians for the backpropagation step. Experiments on visual question answering in the VQA-v2 dataset show competitive accuracies and a selection of regions that mimics human attention more closely in VQA-HAT. We present several examples that suggest how multimodal attention maps are naturally more interpretable than their unimodal counterparts, showing the ability of our model to automatically segregate objects from ground in complex scenes.

READ FULL TEXT

page 2

page 6

page 7

page 8

page 11

research
08/09/2019

Question-Agnostic Attention for Visual Question Answering

Visual Question Answering (VQA) models employ attention mechanisms to di...
research
09/27/2021

VQA-MHUG: A Gaze Dataset to Study Multimodal Neural Attention in Visual Question Answering

We present VQA-MHUG - a novel 49-participant dataset of multimodal human...
research
11/18/2017

Co-attending Free-form Regions and Detections with Multi-modal Multiplicative Feature Embedding for Visual Question Answering

Recently, the Visual Question Answering (VQA) task has gained increasing...
research
08/04/2021

Sparse Continuous Distributions and Fenchel-Young Losses

Exponential families are widely used in machine learning; they include m...
research
02/13/2020

Sparse and Structured Visual Attention

Visual attention mechanisms are widely used in multimodal tasks, such as...
research
06/12/2020

Sparse and Continuous Attention Mechanisms

Exponential families are widely used in machine learning; they include m...
research
08/01/2018

Learning Visual Question Answering by Bootstrapping Hard Attention

Attention mechanisms in biological perception are thought to select subs...

Please sign up or login with your details

Forgot password? Click here to reset