Multimodal Unified Attention Networks for Vision-and-Language Interactions

08/12/2019
by   Zhou Yu, et al.
9

Learning an effective attention mechanism for multimodal data is important in many vision-and-language tasks that require a synergic understanding of both the visual and textual contents. Existing state-of-the-art approaches use co-attention models to associate each visual object (e.g., image region) with each textual object (e.g., query word). Despite the success of these co-attention models, they only model inter-modal interactions while neglecting intra-modal interactions. Here we propose a general `unified attention' model that simultaneously captures the intra- and inter-modal interactions of multimodal features and outputs their corresponding attended representations. By stacking such unified attention blocks in depth, we obtain the deep Multimodal Unified Attention Network (MUAN), which can seamlessly be applied to the visual question answering (VQA) and visual grounding tasks. We evaluate our MUAN models on two VQA datasets and three visual grounding datasets, and the results show that MUAN achieves top-level performance on both tasks without bells and whistles.

READ FULL TEXT

page 1

page 5

page 6

page 9

research
09/27/2021

Multimodal Integration of Human-Like Attention in Visual Question Answering

Human-like attention as a supervisory signal to guide neural attention h...
research
11/04/2020

An Improved Attention for Visual Question Answering

We consider the problem of Visual Question Answering (VQA). Given an ima...
research
08/17/2022

Understanding Attention for Vision-and-Language Tasks

Attention mechanism has been used as an important component across Visio...
research
06/25/2023

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input

The ability to model intra-modal and inter-modal interactions is fundame...
research
02/04/2019

Embodied Multimodal Multitask Learning

Recent efforts on training visual navigation agents conditioned on langu...
research
08/24/2018

A Visual Attention Grounding Neural Model for Multimodal Machine Translation

We introduce a novel multimodal machine translation model that utilizes ...
research
10/05/2020

Attention Guided Semantic Relationship Parsing for Visual Question Answering

Humans explain inter-object relationships with semantic labels that demo...

Please sign up or login with your details

Forgot password? Click here to reset