DeepStory: Video Story QA by Deep Embedded Memory Networks

07/04/2017
by   Kyung-Min Kim, et al.
0

Question-answering (QA) on video contents is a significant challenge for achieving human-level intelligence as it involves both vision and language in real-world settings. Here we demonstrate the possibility of an AI agent performing video story QA by learning from a large amount of cartoon videos. We develop a video-story learning model, i.e. Deep Embedded Memory Networks (DEMN), to reconstruct stories from a joint scene-dialogue video stream using a latent embedding space of observed data. The video stories are stored in a long-term memory component. For a given question, an LSTM-based attention model uses the long-term memory to recall the best question-story-answer triplet by focusing on specific words containing key information. We trained the DEMN on a novel QA dataset of children's cartoon video series, Pororo. The dataset contains 16,066 scene-dialogue pairs of 20.5-hour videos, 27,328 fine-grained sentences for scene description, and 8,913 story-related QA pairs. Our experimental results show that the DEMN outperforms other QA models. This is mainly due to 1) the reconstruction of video stories in a scene-dialogue combined form that utilize the latent embedding and 2) attention. DEMN also achieved state-of-the-art results on the MovieQA benchmark.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/21/2018

Multimodal Dual Attention Memory for Video Story Question Answering

We propose a video story question-answering (QA) architecture, Multimoda...
research
10/15/2014

Memory Networks

We describe a new class of learning models called memory networks. Memor...
research
07/20/2020

Multimodal Dialogue State Tracking By QA Approach with Data Augmentation

Recently, a more challenging state tracking task, Audio-Video Scene-Awar...
research
12/06/2015

A Restricted Visual Turing Test for Deep Scene and Event Understanding

This paper presents a restricted visual Turing test (VTT) for story-line...
research
05/07/2020

DramaQA: Character-Centered Video Story Understanding with Hierarchical QA

Despite recent progress on computer vision and natural language processi...
research
10/14/2019

TruNet: Short Videos Generation from Long Videos via Story-Preserving Truncation

In this work, we introduce a new problem, named as story-preserving lon...
research
05/19/2020

Matching Questions and Answers in Dialogues from Online Forums

Matching question-answer relations between two turns in conversations is...

Please sign up or login with your details

Forgot password? Click here to reset