Answer-checking in Context: A Multi-modal FullyAttention Network for Visual Question Answering

10/17/2020
by   Hantao Huang, et al.
0

Visual Question Answering (VQA) is challenging due to the complex cross-modal relations. It has received extensive attention from the research community. From the human perspective, to answer a visual question, one needs to read the question and then refer to the image to generate an answer. This answer will then be checked against the question and image again for the final confirmation. In this paper, we mimic this process and propose a fully attention based VQA architecture. Moreover, an answer-checking module is proposed to perform a unified attention on the jointly answer, question and image representation to update the answer. This mimics the human answer checking process to consider the answer in the context. With answer-checking modules and transferred BERT layers, our model achieves the state-of-the-art accuracy 71.57% using fewer parameters on VQA-v2.0 test-standard split.

READ FULL TEXT

page 5

page 6

research
11/04/2020

An Improved Attention for Visual Question Answering

We consider the problem of Visual Question Answering (VQA). Given an ima...
research
07/07/2021

MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering

Medical Visual Question Answering (VQA) is a multi-modal challenging tas...
research
04/26/2023

A Symmetric Dual Encoding Dense Retrieval Framework for Knowledge-Intensive Visual Question Answering

Knowledge-Intensive Visual Question Answering (KI-VQA) refers to answeri...
research
11/12/2018

Holistic Multi-modal Memory Network for Movie Question Answering

Answering questions according to multi-modal context is a challenging pr...
research
09/21/2020

Regularizing Attention Networks for Anomaly Detection in Visual Question Answering

For stability and reliability of real-world applications, the robustness...
research
03/27/2023

Curriculum Learning for Compositional Visual Reasoning

Visual Question Answering (VQA) is a complex task requiring large datase...
research
01/11/2022

On the Efficacy of Co-Attention Transformer Layers in Visual Question Answering

In recent years, multi-modal transformers have shown significant progres...

Please sign up or login with your details

Forgot password? Click here to reset