Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering

04/08/2019
by   Chenyou Fan, et al.
0

In this paper, we propose a novel end-to-end trainable Video Question Answering (VideoQA) framework with three major components: 1) a new heterogeneous memory which can effectively learn global context information from appearance and motion features; 2) a redesigned question memory which helps understand the complex semantics of question and highlights queried subjects; and 3) a new multimodal fusion layer which performs multi-step reasoning by attending to relevant visual and textual hints with self-updated attention. Our VideoQA model firstly generates the global context-aware visual and textual features respectively by interacting current inputs with memory contents. After that, it makes the attentional fusion of the multimodal visual and textual representations to infer the correct answer. Multiple cycles of reasoning can be made to iteratively refine attention weights of the multimodal data and improve the final representation of the QA pair. Experimental results demonstrate our approach achieves state-of-the-art performance on four VideoQA benchmark datasets.

READ FULL TEXT

page 1

page 3

research
07/10/2021

DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering

Video question answering is a challenging task, which requires agents to...
research
07/10/2019

Learning to Reason with Relational Video Representation for Question Answering

How does machine learn to reason about the content of a video in answeri...
research
09/21/2018

Multimodal Dual Attention Memory for Video Story Question Answering

We propose a video story question-answering (QA) architecture, Multimoda...
research
12/06/2021

MoCA: Incorporating Multi-stage Domain Pretraining and Cross-guided Multimodal Attention for Textbook Question Answering

Textbook Question Answering (TQA) is a complex multimodal task to infer ...
research
10/03/2022

Extending Compositional Attention Networks for Social Reasoning in Videos

We propose a novel deep architecture for the task of reasoning about soc...
research
12/18/2017

Visual Explanations from Hadamard Product in Multimodal Deep Networks

The visual explanation of learned representation of models helps to unde...
research
09/11/2018

The Visual QA Devil in the Details: The Impact of Early Fusion and Batch Norm on CLEVR

Visual QA is a pivotal challenge for higher-level reasoning, requiring u...

Please sign up or login with your details

Forgot password? Click here to reset