Dynamic Fusion with Intra- and Inter- Modality Attention Flow for Visual Question Answering

12/13/2018
by   Gao Peng, et al.
0

Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fusing multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that the proposed dynamic intra-modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the target modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

READ FULL TEXT

page 2

page 8

research
12/14/2021

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Answering semantically-complicated questions according to an image is ch...
research
11/04/2020

An Improved Attention for Visual Question Answering

We consider the problem of Visual Question Answering (VQA). Given an ima...
research
11/26/2021

Neural Collaborative Graph Machines for Table Structure Recognition

Recently, table structure recognition has achieved impressive progress w...
research
01/25/2022

MGA-VQA: Multi-Granularity Alignment for Visual Question Answering

Learning to answer visual questions is a challenging task since the mult...
research
04/06/2018

Question Type Guided Attention in Visual Question Answering

Visual Question Answering (VQA) requires integration of feature maps wit...
research
06/25/2023

Switch-BERT: Learning to Model Multimodal Interactions by Switching Attention and Input

The ability to model intra-modal and inter-modal interactions is fundame...
research
07/07/2020

3D Shape Reconstruction from Vision and Touch

When a toddler is presented a new toy, their instinctual behaviour is to...

Please sign up or login with your details

Forgot password? Click here to reset