Movie Question Answering: Remembering the Textual Cues for Layered Visual Contents

04/25/2018
by   Bo Wang, et al.
0

Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We conduct extensive experiments on the MovieQA dataset. With only visual content as inputs, LMN with frame-level representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.

READ FULL TEXT

page 2

page 6

page 7

research
07/20/2017

Video Question Answering via Attribute-Augmented Attention Network Learning

Video Question Answering is a challenging problem in visual information ...
research
06/24/2019

Adversarial Multimodal Network for Movie Question Answering

Visual question answering by using information from multiple modalities ...
research
04/18/2019

Progressive Attention Memory Network for Movie Story Question Answering

This paper proposes the progressive attention memory network (PAMN) for ...
research
07/31/2019

Learning Question-Guided Video Representation for Multi-Turn Video Question Answering

Understanding and conversing about dynamic scenes is one of the key capa...
research
08/16/2017

mAnI: Movie Amalgamation using Neural Imitation

Cross-modal data retrieval has been the basis of various creative tasks ...
research
11/09/2017

Enhanced Movie Content Similarity Based on Textual, Auditory and Visual Information

In this paper we examine the ability of low-level multimodal features to...
research
10/13/2016

Video Fill in the Blank with Merging LSTMs

Given a video and its incomplete textural description with missing words...

Please sign up or login with your details

Forgot password? Click here to reset