Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

12/14/2021
by   JianJian Cao, et al.
0

Answering semantically-complicated questions according to an image is challenging in Visual Question Answering (VQA) task. Although the image can be well represented by deep learning, the question is always simply embedded and cannot well indicate its meaning. Besides, the visual and textual features have a gap for different modalities, it is difficult to align and utilize the cross-modality information. In this paper, we focus on these two problems and propose a Graph Matching Attention (GMA) network. Firstly, it not only builds graph for the image, but also constructs graph for the question in terms of both syntactic and embedding information. Next, we explore the intra-modality relationships by a dual-stage graph encoder and then present a bilateral cross-modality graph matching attention to infer the relationships between the image and the question. The updated cross-modality features are then sent into the answer prediction module for final answer prediction. Experiments demonstrate that our network achieves state-of-the-art performance on the GQA dataset and the VQA 2.0 dataset. The ablation studies verify the effectiveness of each modules in our GMA network.

READ FULL TEXT

page 1

page 3

page 5

page 9

page 10

page 11

research
12/13/2018

Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

Learning effective fusion of multi-modality features is at the heart of ...
research
08/10/2019

Multi-modality Latent Interaction Network for Visual Question Answering

Exploiting relationships between visual regions and question words have ...
research
11/04/2020

An Improved Attention for Visual Question Answering

We consider the problem of Visual Question Answering (VQA). Given an ima...
research
01/25/2022

MGA-VQA: Multi-Granularity Alignment for Visual Question Answering

Learning to answer visual questions is a challenging task since the mult...
research
06/10/2022

Less Is More: Linear Layers on CLIP Features as Powerful VizWiz Model

Current architectures for multi-modality tasks such as visual question a...
research
07/04/2020

Modality Shifting Attention Network for Multi-modal Video Question Answering

This paper considers a network referred to as Modality Shifting Attentio...
research
12/18/2020

On Modality Bias in the TVQA Dataset

TVQA is a large scale video question answering (video-QA) dataset based ...

Please sign up or login with your details

Forgot password? Click here to reset