A Layered Memory Network for MovieQA
Movies provide us with a mass of visual content as well as attracting stories. Existing methods have illustrated that understanding movie stories through only visual content is still a hard problem. In this paper, for answering questions about movies, we put forward a Layered Memory Network (LMN) that represents frame-level and clip-level movie content by the Static Word Memory module and the Dynamic Subtitle Memory module, respectively. Particularly, we firstly extract words and sentences from the training movie subtitles. Then the hierarchically formed movie representations, which are learned from LMN, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips. We also extend our LMN model into three variant frameworks to illustrate the good extendable capabilities. We conduct extensive experiments on the MovieQA dataset. With only visual content as inputs, LMN with frame-level representation obtains a large performance improvement. When incorporating subtitles into LMN to form the clip-level representation, we achieve the state-of-the-art performance on the online evaluation task of 'Video+Subtitles'. The good performance successfully demonstrates that the proposed framework of LMN is effective and the hierarchically formed movie representations have good potential for the applications of movie question answering.READ FULL TEXT VIEW PDF
Video Question Answering is a challenging problem in visual information
Visual question answering by using information from multiple modalities ...
This paper proposes the progressive attention memory network (PAMN) for ...
Cross-modal data retrieval has been the basis of various creative tasks
In this paper we examine the ability of low-level multimodal features to...
Given a video and its incomplete textural description with missing words...
A movie's key moments stand out of the screenplay to grab an audience's
A Layered Memory Network for MovieQA
Bridging the visual understanding and computer-human interaction is a challenging task in artificial intelligence. Though visual captioning[Li et al.2017, Yang, Han, and Wang2017, Liu, Li, and Shi2017, Pan et al.2016, Wang et al.2012] has shown to be promising in connecting the visual content to natural languages, it usually narrates the coarse semantic of visual content and lacks abilities of modeling different correlations among visual cues. Whereas visual question answering (VQA) [Malinowski, Rohrbach, and Fritz2015, Zhu et al.2016, Xiong, Merity, and Socher2016]
relies on the holistic scene understanding to find correct answers for different levels of visual understanding. Popular methods towards VQA aim to learn the co-occurrence of a particular combination of features extracted from images and questions, e.g., space embedding for images and words via Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs)[Malinowski, Rohrbach, and Fritz2015]. In order to accurately associate specific language elements with particular visual content, attention [Zhu et al.2016] or dynamic memory [Xiong, Merity, and Socher2016] mechanisms were proposed to improve the performance of VQA.
As videos could be taken as a spatio-temporal extension of images, video understanding requires a better representation to encode both the visual content of each frame and the temporal dependencies among successive video frames. Different from other videos, most movies have a specific background (e.g., action film and war film) as well as the shooting style (e.g., flashback). Thus understanding the story of a movie via only the visual content is really a challenging task. On the other hand, a movie always contains a standard subtitle which consists of the dialogues between actors. This offers a possibility to better understand the story of a movie and thus facilitate applications of automatically movie question answering.
In this paper, we explore how to utilize movie clips and subtitles for movie question answering. We propose a Layered Memory Network (LMN) to learn a layered representation of movie content, i.e., frame-level and clip-level layers, by a Static Word Memory module and a Dynamic Subtitle Memory module, respectively. The Static Word Memory contains all the words information of the MovieQA dataset [Tapaswi et al.2016] while the Dynamic Subtitle Memory contains all the subtitle information. The framework of our proposed method is illustrated in Fig. 1. Firstly, we get the frame-level representation by representing regional features of each movie frame with Static Word Memory module. Secondly, the generated frame-level representation is fed into the Dynamic Subtitle Memory module to obtain the final clip-level representation. Thus, the hierarchically formed movie representations, i.e., the layered representation learned by LMN from frame and clip-level, not only encode the correspondence between words and visual content inside frames, but also encode the temporal alignment between sentences and frames inside movie clips.
The main contributions of this paper are summarized as follows. (1) We propose a Layered Memory Network which can utilize visual and text information. The model can represent movie content with more semantic information in both frame-level and clip-level. (2) We propose three extended frameworks base on our LMN, which can remove the irrelevant information from massive external texts and can improve the reasoning ability of the LMN model. (3) The LMN method shows good performance of movie question answering on the MovieQA dataset. We obtain the state-of-the-art performance on the online evaluation111http://movieqa.cs.toronto.edu/leaderboard/#table-movie task of ‘Video+Subtitles’ for movie question answering.
Methods towards image question answering are mainly categorized into joint embedding [Ren, Kiros, and Zemel2015, Ma, Lu, and Li2016], attention mechanism [Zhu et al.2016, Yang et al.2016, Shih, Singh, and Hoiem2016, Xu and Saenko2016], and incorporating external knowledge [Nie et al.2013, Wang et al.2016, Wu et al.2016, Zhu, Lim, and Fei-Fei2017]
etc. The joint embedding methods target to learn the co-occurrence of a particular combination of features extracted from images and questions, whereas attention mechanism is utilized to accurately associate specific language elements with particular visual content. As knowledge-based reasoning is effective in traditional question answering system, experiments have shown that appropriately utilizing external knowledge could improve the performance of image question answering, such as storing the vectorized facts into a dynamic memory network[Xiong, Merity, and Socher2016]
or connecting the CNN-extracted visual concepts to the node on a knowledge graph[Wang et al.2017].
As videos could be taken as spatio-temporal extensions of images, how to incorporate the temporal cues into the video representation and associate it to certain textual cues in questions is crucial to the video question answering [Zhu et al.2017, Zeng et al.2017]. As videos are more complex than images, datasets construction to boost the research of video question answering is a challenge task, such as TGIF-QA [Jang et al.2017], MarioQA [Mun et al.2017], the ‘fill-in-the-blank’ [Zhu et al.2017] and the large-scale video question answering dataset without manual annotations [Zeng et al.2017]. Recently, Tapaswi et al. [Tapaswi et al.2016] proposed a MovieQA dataset with multiple sources for movie question answering, which successfully attracted interesting work, such as video-story learning [Kim et al.2017] and multi-modal movie question answering [Na et al.2017]. Though the Deep Embedded Memory Networks (DEMN) [Kim et al.2017] could reconstruct stories from a joint scene-dialogue video stream, it is not trained in an end-to-end way. The Read-Write Memory Network (RWMN) [Na et al.2017] utilizes multi-layered CNNs to capture and store the sequential information of movies into the memory, so as to help answer movie questions. As a specific application of video question answering, movie question answering requires both accurate visual information and high-level semantic to infer the answer. Subtitles accompanied movie clips may implicitly narrate the story cues temporally aligned with sequential frames, or even convey relationships among roles and objects.
In this section, we introduce the proposed method of Layered Memory Network (LMN), which utilizes both movie contents and subtitles to answer the movie question. The input of the LMN is a sequence of frame-wise feature maps and question . The output is the correct answer which is predicted by the LMN. In the following subsections, we firstly introduce how the LMN represent the visual content on both frame-level and clip-level in word space and sentence space. As the proposed LMN is a basic framework for movie question answering, we then propose three variants to demonstrate the extendable capability of our framework.
The main objective of this module is to get a semantic representation of specific region in the movie frames through Static Word Memory. The framework is shown in Fig. 2. Suppose we have a Static Word Memory of , which can be seen as a word embedding matrix with a vocabulary size that maps words into a continuous vector of -dimension. The Static Word Memory can be learned by skip-gram model [Mikolov et al.2013]. There is a sequence of frame-wise feature maps , which are extracted from a convolutional layer of CNNs, and each of which has the shape of , where denote the channel, height, and width of feature maps, respectively. Thus we can obtain regional features and . Different from the joint embedding methods which directly map the regional features into a common space, we represent the regional features with the Static Word Memory. We first map our regional features into the word space with a projection by , where the can be seen as the -th projected regional feature of -th movie frame. Then we utilize an inner product to compute the similarity between the projected regional features and words of the Static Word Memory. The formulation is defined as:
where denotes the -th row vector of Static Word Memory . Both and are first scaled to have unit norm, thus the
is equivalent to the cosine similarity. Then the regional featurecan be replaced by a weighted sum over all words of the memory:
where denotes the size of our Static Word Memory. The represents the ‘update’ operation. Thus the frame-level representation can be computed by:
Because each regional feature is a weighted sum over the whole words vectors, the can be seen as a semantic representation of -th movie frame. This procedure is similar as the attention mechanism on the regions of an image. However, our model attends to the memory of the words for each regional features.
The proposed Static Word Memory has two properties: (1) Static Word Memory could be taken as a word embedding matrix. We can utilize the memory to map word into a continuous vector. (2) Static Word Memory could be utilized to represent the regional features of movie frames.
The main objective of this module is to get a semantic representation of the specific frame in movie clips through Dynamic Subtitle Memory. As a movie contains not only visual content but also subtitles, we put forward a Dynamic Subtitle Memory module to represent movie clips with movie subtitles. Suppose the movie frames have been represented as frame-level representation which is the output of Eq. (3). Then the subtitles , which are first embedded by the Static Word Memory, are utilized as Dynamic Subtitle Memory. We have a similar representing procedure:
where denotes the similarity of -th subtitle corresponding to the -th frame, and the Eq. (5) represents the -th frame representation is replaced by weighted summing over all subtitles. The clip-level representation can be obtained by summing over the frame-level representation . For the task of movie question answering with video as the only inputs, the frame-level representation in Eq. (6) is the output of Eq. (3). As a result, our clip-level representation is transformed from word space into sentence space and thus can obtain much semantic information. Then we use the method from [Tapaswi et al.2016] to answer the movie questions with open-ended answers as follows:
where denotes the question vector, and denote the -th answer. Both question and answers are embedded by the Static Word Memory . Note that the Static Word Memory is shared for representing the regional features of movie frames and words of sentences. However, the Subtitle Memory is different for each movie as different movies may have unique subtitles. Thus our Word Memory is static but our Subtitle Memory is dynamic. And both Static Word Memory model and Dynamic Subtitle Memory model discussed above only have one memory layer.
In summary, the Layered Memory Network has following advantages: (1) Instead of learning the joint embedding matrices, we directly replace the regional features and frame-level features by the Static Word Memory and the Dynamic Subtitle Memory, respectively. The layered frame-level and clip-level representations contain richer semantic information and achieve good performance on movie question answering, which will be discussed in detail in our experiments. (2) Benefiting from exploiting accuracy words and subtitles to represent each regional and frame-level features, our method obtains a good property of supporting accurate reasoning while answering a question. For example, we can find out the most relative subtitle of each movie frame. Some of our results are shown in Fig.4 and Fig. 5. (3) Our method is a basic framework which has a good extendable capability. We will introduce some extended frameworks based on our method in the following section.
As described above, we only use a single Static Word Memory, which means the region representation is obtained only by one mapping process. In this case, the region representation may contain much irrelevant information and lack some key content to answer the question. So according to the multiple hops mechanism [Sukhbaatar et al.2015] in memory network, we consider using multiple hops in Static Word Memory to get a better frame-level semantic representation. Different from the memory network which uses the sum of the output and the input of layer as the input of next layer. We only use the regions’ semantic representation, which are the output of the -th Static Word Memory, as the input of the -th Static Word Memory. So Eq. (1)-(2) can be replaced as:
The regional representation is the output of the -th Static Word Memory. The is the -th row of the -th Static Word Memory. The denotes the cosine similarity between and . Note that, the Static Word Memory we use to do multiple hops are the same, . And the Static Word Memory consists of words in all movie subtitles. Thus the ‘Static’ has two meanings here: (1) The Word Memory is shared by all movies in the MovieQA dataset. (2) The Word Memory remains the same during the multiple hops in Static Word Memory.
The subtitle contains complete dialogues in the movie. However, the movie clips only contains certain segments of the whole movie. So even we utilize the weighted sum over the original subtitles to form the clip-level representation, there will still contain a lot of irrelative information in the clip-level representation. To solve this problem, we make two improvements: (1) We use the multiple hops mechanism [Sukhbaatar et al.2015] in Dynamic Subtitle Memory to enhance the reasoning ability of the module. (2) We update the Dynamic Subtitle Memory to remove the irrelative information. From the Eq. (6), we can get the clip-level semantic representation by the Dynamic Subtitle Memory. Then we use to update the Dynamic Subtitle Memory. The framework is shown in Fig. 3. And the update procedures are computed as follows:
The is clip-level representation obtained by sum the output of the -th Dynamic Subtitle Memory. And the is the -th sentences of the -th Dynamic Subtitle Memory. The denotes the relationship between and
Note that, the ‘Dynamic’ here has two meanings: (1) As described in the basic LMN, the Subtitle Memory is different for each movie. (2) In the Dynamic Subtitle Memory module with update mechanism, the Subtitle Memory will be updated by the clip-level representation.
We want to use the clip-level representation which is the output of the Dynamic Subtitle Memory module to answer questions. But the question information is not used in the Dynamic Subtitle Memory. So there will contain some information that can represent the clip-level movie content but is not relative to the question. Different from SMem-VQA [Xu and Saenko2016] module which utilizes questions to attend to relevant regions, we propose a question-guided model to attend to the subtitles. We first utilize the questions to update the subtitles. Then the question-guided subtitles are applied to represent the movie clips. The question-guided subtitles can be computed by:
where and denotes the question representation. Suppose that both the subtitles and the question are embedded by Static Word Memory and followed by a mean-pooling over all words of the sentence to get the final representation. Thus we can obtain a question sensitive subtitles according to the similarity of each subtitle and question representation.
Besides the three extend frameworks described above, we combine the update mechanism and the question-guided Model together. In detail, after we get the final Dynamic Subtitle Memory which is updated by Eq. (11), we use the question-guided model to update it again. Note that all the extended frameworks will not increase any learning parameters and thus are efficient.
We evaluate the Layered Memory Network on the MovieQA dataset [Tapaswi et al.2016], which contains multiple sources of information such as video clips, plots, subtitles, and scripts. This dataset consists of 14,944 multiple-choice questions about 408 movies. Each question has five open-ended choices but only one of them is correct. All of them have subtitles but only 140 movies have video clips. We focus on the task of ‘Video+Subtitels’. The 6,462 questions-answer pairs are split into 4,318, 886, and 1,258 for training, validation, and test set, respectively. Also, the 140 movies (totally 6,771 clips) are split into 4385, 1098 and 1288 clips for training, validation, and test set, respectively. Note that one question may be relevant to several movie clips. As the test set only can be tested once per 72 hours on an online evaluation server. Following [Tapaswi et al.2016], we also split the training set (4,318 question-answer pairs) into 90% train / 10% development (dev.). As the answer type is multiple choices, the performance is measured by accuracy.
For video frame feature extraction, we first extract frames from all movie clips with the rate of 1 frame per second. Following [Tapaswi et al.2016], we also sample frames from all clips but only obtain movie frames with equal space. In our experiments, the extracted movie frames are first resized into . Then we extract features of ‘pool5’ and ‘inception_5b/output’ layers from VGG-16 [Simonyan and Zisserman2015] and GoogLeNet [Szegedy et al.2015], respectively. The ‘pool5’ layer has the shape of while the ‘inception_5b/output’ layer has the shape of .
For the Static Word Memory, we utilize the word2vec model supported by [Tapaswi et al.2016], which is trained by skip-gram model [Mikolov et al.2013] on about 1200 movie plots. The word2vec model has the embedding dimension of 300. For fair comparisons and following [Tapaswi et al.2016], we also fix our Static Word Memory while training the models. For the Dynamic Subtitle Memory, we utilize all the sentences in each movie subtitle.
For training our LMN model, all the model parameters are optimized by minimizing the cross-entropy loss using stochastic gradient descent. The batch size is set to 8 and the learning rate is set to 0.01. We perform early stopping on the dev set (10% of the training set).
|LMN + (G)||38.3||38.1||39.3|
|LMN + (V)||38.6||38.1||39.6|
In this subsection, we evaluate the performance of our proposed Layered Memory Network (LMN). We compare LMN with two baseline models. The SSCB [Tapaswi et al.2016] utilizes a neural network to learn a factorization of the question and answer similarity. The MemN2N model [Sukhbaatar et al.2015] is first proposed for text question answering (QA) and modified by [Tapaswi et al.2016] for movie QA. All the results have been illustrated in Table 1.
|LMN + (G)||39.3|
|LMN + (V)||39.6|
|LMN + Multiple hops in SWM + (G)||39.6|
|LMN + Multiple hops in SWM + (V)||39.6|
|LMN + Multiple hops in SWM + (G)||40.1|
|LMN + Multiple hops in SWM + (V)||39.9|
|LMN + UM + (G)||41.4|
|LMN + UM + (V)||41.6|
|LMN + UM + (G)||41.5|
|LMN + UM + (V)||41.4|
|LMN + QG + (G)||40.6|
|LMN + QG + (V)||40.1|
|LMN + UM + QG + (G)||42.3|
|LMN + UM + QG + (V)||42.5|
From Table 1, we can first see that understanding movie stories by only exploiting video content is a really hard problem which can be proved by the near random-guess result of ‘SSCB’ and ‘MemN2N’ methods. Secondly, our LMN model with VGG-16 features obtains a large margin performance gains of 15.5% by only exploiting the video content. Even compared with ‘MemN2N’, which takes both video and subtitles as inputs, LMN still outperforms it by 4.4%. Note that LMN only contains frame-level representations without exploiting movie subtitles. While on ‘Video+Subtitles’ task, the ‘SSCB’ has a near random-guess performance while ‘MemN2N’ degrades the performance about 3.8%. LMN obtains a further performance improvement of 1%. We repeat the MemN2N model and obtain a compete performance of 37.45%. This can illustrate that the performance improvement results from the effectiveness of LMN model. Finally, while using GoogLeNet features, our method obtains similar performance improvement. In summary, we can conclude that the semantic information (e.g., movie subtitles) is important for movie question answering and LMN can perform well on movie stories understanding even without subtitles.
All the results have been shown in Table 2. From the second block of Table 2, we can see that the different number of Static Word Memory have the similar performance. The reasons might be: Firstly, the Static Word Memory is composed by a vocabulary size 26,630 and = 300. Although we use the multiple hops mechanism, it is difficult to get a more precise representation from this large memory. Secondly, in the multiple hops process, we get the regional representation of each frame without knowing anything about the question. we get the region representation but it may have no connection with the question and answers.
The third block of Table 2 shows the result of the update mechanism of Dynamic Subtitle Memory. We obtain a performance improvement of 2.0% when use two Dynamic Subtitle Memories. This can illustrate that through the update mechanism the model does remove some irrelevant information and could reach a better understanding ability.
From the forth block of Table 2, we can observe that LMN with Question Guided extension obtains a performance improvement of 1.0% by taking VGG-16 features as inputs. Besides the two extended frameworks we also combine update mechanism and the question-guided model together. The results are listed in the fifth row of Table 2. We obtain a performance improvement of 0.9% than that of only using the update mechanism. Thus the question-guided mode could make the Dynamic Subtitle Memory more relevant to questions and LMN has a good extendable capability.
We first show examples from LMN with only frame-level representation, i.e., with no subtitles incorporated. From examples in Fig. 4 we can see that, though there are no direct connections between words in questions and answers, LMN successfully hits the correct answers. There examples illustrate that LMN could accurately associate specific language elements to particular video content.
In Fig. 5, we show examples from LMN model, i.e., with subtitles incorporated. From couples of examples inside the red bounding boxes we can see that, given the same set of candidate subtitles within one movie, LMN successfully infers relevant relationship between subtitles and question-answer pairs. For example, ‘Sherry’ and ‘Red ink on pink paper’ are ranked higher for the answer ‘A pink letter from Sherry’ than that for the answer ‘She is an animal communicator’, i.e., with rank index of ‘28’ and ‘63’ vs. ‘78’ and ‘82’. Examples in the blue bounding box also illustrate the effectiveness of our method.
|LSTM + CNN||23.45|
|LSTM + Discriminative CNN||24.32|
|DEMN [Kim et al.2017]||29.97|
|RWMN [Na et al.2017]||36.25|
|LMN + (V) + (Video only)||34.34|
|LMN + UM + QG + (V)||39.03|
We evaluated the proposed models with the test set on the MovieQA online evaluation server. All the results are shown in Table 3. We also list some results with top performance on the ‘Leader Board’222http://movieqa.cs.toronto.edu/leaderboard/#table-movie. From Table 3, we can see that LMN with only frame-level representation obtains a compete performance. Particularly, LMN with update mechanism and question-guided model outperforms other methods by about 2.78% in ‘Video+Subtitles’ task. Moreover, LMN with update mechanism and question-guided model ranked the first (till Sept. 10, 2017).
In this paper, we propose a Layered Memory Network (LMN) for movie question answering. LMN learns a layered representation of movie content, which not only encodes the correspondence between words and visual content inside frames but also encodes the temporal alignment between sentences and frames inside movie clips. We also extend LMN model to three extended frameworks. Experimental results on the MovieQA dataset show the effectiveness and better performance of our method. We also illustrate the effectiveness by movie question answering examples. In addition, on the online evaluation server, LMN with update mechanism and question-guided together ranked the first on the ‘Video+Subtitles’ task.
This work is supported by the NSFC (under Grant U1509206, 61722204, 61472116, 61472276).
Ask your neurons: A neural-based approach to answering questions about images.In Proceedings of the IEEE International Conference on Computer Vision, 1–9.
International Conference on Machine Learning, 2397–2406.