Modality Shifting Attention Network for Multi-modal Video Question Answering

07/04/2020
by   Junyeong Kim, et al.
0

This paper considers a network referred to as Modality Shifting Attention Network (MSAN) for Multimodal Video Question Answering (MVQA) task. MSAN decomposes the task into two sub-tasks: (1) localization of temporal moment relevant to the question, and (2) accurate prediction of the answer based on the localized moment. The modality required for temporal localization may be different from that for answer prediction, and this ability to shift modality is essential for performing the task. To this end, MSAN is based on (1) the moment proposal network (MPN) that attempts to locate the most appropriate temporal moment from each of the modalities, and also on (2) the heterogeneous reasoning network (HRN) that predicts the answer using an attention mechanism on both modalities. MSAN is able to place importance weight on the two modalities for each sub-task using a component referred to as Modality Importance Modulation (MIM). Experimental results show that MSAN outperforms previous state-of-the-art by achieving 71.13% test accuracy on TVQA benchmark dataset. Extensive ablation studies and qualitative analysis are conducted to validate various components of the network.

READ FULL TEXT

page 1

page 7

page 8

research
07/20/2017

Video Question Answering via Attribute-Augmented Attention Network Learning

Video Question Answering is a challenging problem in visual information ...
research
05/28/2019

Gaining Extra Supervision via Multi-task learning for Multi-Modal Video Question Answering

This paper proposes a method to gain extra supervision via multi-task le...
research
12/14/2021

Bilateral Cross-Modality Graph Matching Attention for Feature Fusion in Visual Question Answering

Answering semantically-complicated questions according to an image is ch...
research
04/18/2019

Progressive Attention Memory Network for Movie Story Question Answering

This paper proposes the progressive attention memory network (PAMN) for ...
research
03/08/2021

Content-Based Detection of Temporal Metadata Manipulation

Most pictures shared online are accompanied by a temporal context (i.e.,...
research
10/03/2022

Extending Compositional Attention Networks for Social Reasoning in Videos

We propose a novel deep architecture for the task of reasoning about soc...
research
10/05/2022

Locate before Answering: Answer Guided Question Localization for Video Question Answering

Video question answering (VideoQA) is an essential task in vision-langua...

Please sign up or login with your details

Forgot password? Click here to reset