A Novel Attention-based Aggregation Function to Combine Vision and Language

04/27/2020
by   Matteo Stefanini, et al.
9

The joint understanding of vision and language has been recently gaining a lot of attention in both the Computer Vision and Natural Language Processing communities, with the emergence of tasks such as image captioning, image-text matching, and visual question answering. As both images and text can be encoded as sets or sequences of elements – like regions and words – proper reduction functions are needed to transform a set of encoded elements into a single response, like a classification or similarity score. In this paper, we propose a novel fully-attentive reduction method for vision and language. Specifically, our approach computes a set of scores for each element of each modality employing a novel variant of cross-attention, and performs a learnable and cross-modal reduction, which can be used for both classification and ranking. We test our approach on image-text matching and visual question answering, building fair comparisons with other reduction choices, on both COCO and VQA 2.0 datasets. Experimentally, we demonstrate that our approach leads to a performance increase on both tasks. Further, we conduct ablation studies to validate the role of each component of the approach.

READ FULL TEXT

page 1

page 4

page 7

page 8

research
07/07/2021

MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering

Medical Visual Question Answering (VQA) is a multi-modal challenging tas...
research
05/29/2019

Vision-to-Language Tasks Based on Attributes and Attention Mechanism

Vision-to-language tasks aim to integrate computer vision and natural la...
research
06/12/2018

iParaphrasing: Extracting Visually Grounded Paraphrases via an Image

A paraphrase is a restatement of the meaning of a text in other words. P...
research
08/17/2022

Understanding Attention for Vision-and-Language Tasks

Attention mechanism has been used as an important component across Visio...
research
08/20/2023

Generic Attention-model Explainability by Weighted Relevance Accumulation

Attention-based transformer models have achieved remarkable progress in ...
research
02/28/2020

Exploring and Distilling Cross-Modal Information for Image Captioning

Recently, attention-based encoder-decoder models have been used extensiv...
research
08/28/2019

Adversarial Representation Learning for Text-to-Image Matching

For many computer vision applications such as image captioning, visual q...

Please sign up or login with your details

Forgot password? Click here to reset