Understanding Attention for Vision-and-Language Tasks

08/17/2022
by   Feiqi Cao, et al.
0

Attention mechanism has been used as an important component across Vision-and-Language(VL) tasks in order to bridge the semantic gap between visual and textual features. While attention has been widely used in VL tasks, it has not been examined the capability of different attention alignment calculation in bridging the semantic gap between visual and textual clues. In this research, we conduct a comprehensive analysis on understanding the role of attention alignment by looking into the attention score calculation methods and check how it actually represents the visual region's and textual token's significance for the global assessment. We also analyse the conditions which attention score calculation mechanism would be more (or less) interpretable, and which may impact the model performance on three different VL tasks, including visual question answering, text-to-image generation, text-and-image matching (both sentence and image retrieval). Our analysis is the first of its kind and provides useful insights of the importance of each attention alignment score calculation when applied at the training phase of VL tasks, commonly ignored in attention-based cross modal models, and/or pretrained models.

READ FULL TEXT

page 6

page 8

page 12

page 13

page 14

page 15

page 16

research
10/18/2022

Probing Cross-modal Semantics Alignment Capability from the Textual Perspective

In recent years, vision and language pre-training (VLP) models have adva...
research
08/12/2019

Multimodal Unified Attention Networks for Vision-and-Language Interactions

Learning an effective attention mechanism for multimodal data is importa...
research
04/27/2020

A Novel Attention-based Aggregation Function to Combine Vision and Language

The joint understanding of vision and language has been recently gaining...
research
11/12/2018

An Introductory Survey on Attention Mechanisms in NLP Problems

First derived from human intuition, later adapted to machine translation...
research
08/07/2017

Identity-Aware Textual-Visual Matching with Latent Co-attention

Textual-visual matching aims at measuring similarities between sentence ...

Please sign up or login with your details

Forgot password? Click here to reset