Improved Fusion of Visual and Language Representations by Dense Symmetric Co-Attention for Visual Question Answering

04/03/2018
by   Duy-Kien Nguyen, et al.
0

A key solution to visual question answering (VQA) exists in how to fuse visual and language features extracted from an input image and question. We show that an attention mechanism that enables dense, bi-directional interactions between the two modalities contributes to boost accuracy of prediction of answers. Specifically, we present a simple architecture that is fully symmetric between visual and language representations, in which each question word attends on image regions and each image region attends on question words. It can be stacked to form a hierarchy for multi-step interactions between an image-question pair. We show through experiments that the proposed architecture achieves a new state-of-the-art on VQA and VQA 2.0 despite its small size. We also present qualitative evaluation, demonstrating how the proposed attention mechanism can generate reasonable attention maps on images and questions, which leads to the correct answer prediction.

READ FULL TEXT
research
05/31/2016

Hierarchical Question-Image Co-Attention for Visual Question Answering

A number of recent works have proposed attention models for Visual Quest...
research
06/20/2016

DualNet: Domain-Invariant Network for Visual Question Answering

Visual question answering (VQA) task not only bridges the gap between im...
research
04/17/2023

A Question-Answering Approach to Key Value Pair Extraction from Form-like Document Images

In this paper, we present a new question-answering (QA) based key-value ...
research
03/29/2019

Relation-aware Graph Attention Network for Visual Question Answering

In order to answer semantically-complicated questions about an image, a ...
research
04/19/2023

SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery

Advances in GPT-based large language models (LLMs) are revolutionizing n...
research
11/12/2017

High-Order Attention Models for Visual Question Answering

The quest for algorithms that enable cognitive abilities is an important...
research
06/12/2016

Training Recurrent Answering Units with Joint Loss Minimization for VQA

We propose a novel algorithm for visual question answering based on a re...

Please sign up or login with your details

Forgot password? Click here to reset