Multimodal Dual Attention Memory for Video Story Question Answering

09/21/2018
by   Kyung-Min Kim, et al.
0

We propose a video story question-answering (QA) architecture, Multimodal Dual Attention Memory (MDAM). The key idea is to use a dual attention mechanism with late fusion. MDAM uses self-attention to learn the latent concepts in scene frames and captions. Given a question, MDAM uses the second attention over these latent concepts. Multimodal fusion is performed after the dual attention processes (late fusion). Using this processing pipeline, MDAM learns to infer a high-level vision-language joint representation from an abstraction of the full video content. We evaluate MDAM on PororoQA and MovieQA datasets which have large-scale QA annotations on cartoon videos and movies, respectively. For both datasets, MDAM achieves new state-of-the-art results with significant margins compared to the runner-up models. We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies. We also perform qualitative analysis by visualizing the inference mechanisms of MDAM.

READ FULL TEXT
research
07/04/2017

DeepStory: Video Story QA by Deep Embedded Memory Networks

Question-answering (QA) on video contents is a significant challenge for...
research
10/03/2022

Extending Compositional Attention Networks for Social Reasoning in Videos

We propose a novel deep architecture for the task of reasoning about soc...
research
08/07/2018

A Joint Sequence Fusion Model for Video Question Answering and Retrieval

We present an approach named JSFusion (Joint Sequence Fusion) that can m...
research
06/05/2016

Multimodal Residual Learning for Visual QA

Deep neural networks continue to advance the state-of-the-art of image r...
research
04/18/2019

Progressive Attention Memory Network for Movie Story Question Answering

This paper proposes the progressive attention memory network (PAMN) for ...
research
04/08/2019

Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

In this paper, we propose a novel end-to-end trainable Video Question An...
research
02/17/2023

Tensorized Optical Multimodal Fusion Network

We propose the first tensorized optical multimodal fusion network archit...

Please sign up or login with your details

Forgot password? Click here to reset