VQA with Cascade of Self- and Co-Attention Blocks

02/28/2023
by   Aakansha Mishra, et al.
0

The use of complex attention modules has improved the performance of the Visual Question Answering (VQA) task. This work aims to learn an improved multi-modal representation through dense interaction of visual and textual modalities. The proposed model has an attention block containing both self-attention and co-attention on image and text. The self-attention modules provide the contextual information of objects (for an image) and words (for a question) that are crucial for inferring an answer. On the other hand, co-attention aids the interaction of image and text. Further, fine-grained information is obtained from two modalities by using a Cascade of Self- and Co-Attention blocks (CSCA). This proposal is benchmarked on the widely used VQA2.0 and TDIUC datasets. The efficacy of key components of the model and cascading of attention modules are demonstrated by experiments involving ablation analysis.

READ FULL TEXT

page 2

page 16

page 17

research
09/27/2021

Multimodal Integration of Human-Like Attention in Visual Question Answering

Human-like attention as a supervisory signal to guide neural attention h...
research
10/01/2022

A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering

Research in medical visual question answering (MVQA) can contribute to t...
research
06/01/2020

Multimodal grid features and cell pointers for Scene Text Visual Question Answering

This paper presents a new model for the task of scene text visual questi...
research
10/08/2019

Modulated Self-attention Convolutional Network for VQA

As new data-sets for real-world visual reasoning and compositional quest...
research
05/11/2023

EAML: Ensemble Self-Attention-based Mutual Learning Network for Document Image Classification

In the recent past, complex deep neural networks have received huge inte...
research
02/25/2023

Medical visual question answering using joint self-supervised learning

Visual Question Answering (VQA) becomes one of the most active research ...
research
05/13/2022

Local Attention Graph-based Transformer for Multi-target Genetic Alteration Prediction

Classical multiple instance learning (MIL) methods are often based on th...

Please sign up or login with your details

Forgot password? Click here to reset