TrUMAn: Trope Understanding in Movies and Animations

08/10/2021 ∙ by Hung-Ting Su, et al. ∙ 0

Understanding and comprehending video content is crucial for many real-world applications such as search and recommendation systems. While recent progress of deep learning has boosted performance on various tasks using visual cues, deep cognition to reason intentions, motivation, or causality remains challenging. Existing datasets that aim to examine video reasoning capability focus on visual signals such as actions, objects, relations, or could be answered utilizing text bias. Observing this, we propose a novel task, along with a new dataset: Trope Understanding in Movies and Animations (TrUMAn), intending to evaluate and develop learning systems beyond visual signals. Tropes are frequently used storytelling devices for creative works. By coping with the trope understanding task and enabling the deep cognition skills of machines, we are optimistic that data mining applications and algorithms could be taken to the next level. To tackle the challenging TrUMAn dataset, we present a Trope Understanding and Storytelling (TrUSt) with a new Conceptual Storyteller module, which guides the video encoder by performing video storytelling on a latent space. The generated story embedding is then fed into the trope understanding model to provide further signals. Experimental results demonstrate that state-of-the-art learning systems on existing tasks reach only 12.01 human-annotated descriptions, BERT contextual embedding achieves at most 28 accuracy. Our proposed TrUSt boosts the model performance and reaches 13.94 performance. We also provide detailed analysis to pave the way for future research. TrUMAn is publicly available at:



There are no comments yet.


page 2

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Understanding and comprehending rich information in a video clip is crucial for various applications, including but not limited to information retrieval, recommendation systems, or question answering systems. Recent deep learning progress has boosted the performance on many large-scale benchmarks with shallow visual semantics such as action recognition, video search, or video question answering (Video QA). However, deep cognition skills to perform causal and motivational comprehension remains challenging for modern learning models, as mentioned by recent research (Bengio, 2019; Chang et al., 2021).

Real-world applications and users might interest in a higher level of concepts beyond shallow visual semantics. For example, audiences of a video clip might be interested in another video with a similar story, plot, or sentiments, such as bittersweet ending. Nevertheless, such concepts are rarely demonstrated in visual manners like object or action occurrences. Therefore, a recommendation system needs to understand the causality and the motivation behind bittersweet ending where the achievement of the protagonist together with a heavy price paid, instead of tracking shallow semantics, such as a trophy or a crown.

Figure 1. Trope Understanding in Movies and Animations (TrUMAn). Trope understanding requires deep cognition skills to comprehend the causality and the motivation beyond visual semantics. The first column and the second column are represented in similar visual semantics (watching TV) but conveying different stories. On the other hand, the third column is visually different from the second column, delivered with diverse signals, but is more similar in human cognition.

Many efforts have been devoted to building datasets to evaluate and develop learning systems. Video QA (Zeng et al., 2017; Xu et al., 2017; Yu et al., 2019) and multiple-choice movie question answering (MC-MQA) (Tapaswi et al., 2016; Kim et al., 2017; Lei et al., 2018; Liu et al., 2020), in particular, aim to examine the machine capability of reasoning. However, Video QA datasets focus on visual cues such as actions, objects, or relations, which are too shallow to represent deep cognition skills. MC-MQA datasets, while emphasizing concepts beyond visual cues, such as “why” questions, could easily overfit the corresponding language query, as confirmed by recent research (Winterbottom et al., 2020; B. Jasani and Ramanan, 2019; Yang et al., 2020).

We look for another approach: Tropes, which are storytelling devices for creative works such as movies, animations, or literature. Beyond object and action co-occurrences, they are the tools that art creators use to deliver abstract and complex ideas to the audience without spelling out all the details. For example, Bad Boss means a boss callously mistreats their employees. This trope could be portrayed with a scene where a boss is punching or even killing their subordinates. While this concept is trivial for an educated human, it requires a learning system to comprehend the motivation of the action and the cause of the torture scene. Hence, different from conventional Video QA tasks, tropes involve consciousness, systematic generalization, causality, and motivational inference (Chang et al., 2021). A recent work (Chang et al., 2021) utilizes movie synopses and tropes from the TVTropes database to evaluate the reasoning capability of learning systems. However, movie synopses are written by humans and consequently rarely available in practice. Furthermore, movie synopses usually contain implicit human interpretation instead of raw visual signals. Therefore, it would be easier for a machine to capture the tropes in films using human-written synopses.

We are optimistic that trope understanding capability could bring significant leaps forward in both data mining applications and algorithms, from developing a recommendation system to studying motivational behaviors beyond visual semantic. Therefore, we propose a novel task, along with a new dataset, Trope Understanding in Movies and Animations (TrUMAn), including 2423 videos associating with 132 tropes. TrUMAn inputs video and audio signals, reflecting the real world where people interact with each other and the environment instead of human-written synopses. Unlike traditional datasets and approaches, which focus on visual signals, our TrUMAn requires deep cognition skills of learning systems. For instance, as shown in Figure 1, similar visual clues (watching a screen) are portrayed in the first and the second columns but delivering different stories. In contrast, the video clip in the third column has the same trope as the second column (Bad Boss), conveying similar abstract concepts, but the visual contents are totally different.

To tackle the novel yet challenge TrUMAn, we propose a new Trope Understanding and Storytelling (TrUSt) model, which jointly understand trope and perform storytelling on a latent space in a multi-task manner. Specifically, the Conceptual Storyteller generates a story embedding vector, which represents the video description. Next, the generated vector is optimized by a human-written description embedded by a pre-trained text encoder. Finally, the generated story embedding vector is fed to a trope understanding model to determine the output trope. By utilizing the generated story embedding, human-written descriptions are not needed during inference.

Experimental results demonstrate that modern learning systems still struggle to solve the trope understanding task, reaching at most 14% accuracy. State-of-the-art models, including graph-based L-GCN Video QA model (Huang et al., 2020), and cross-modal pre-training-based XDC action recognition model (Alwassel et al., 2020), while utilizing visual semantics to perform well on existing tasks, could not solve the trope understanding task. With the aid of human-written description, the accuracy could be boosted to 28%, indicating that trope understanding dataset using movie synopses (Chang et al., 2021)

might over-estimate machine deep cognition capability. Moreover, we provide a comprehensive analysis to pave a new path for future research. Consequently, we are optimistic that our proposed task and dataset could bring learning systems to the next level.

2. Impact and Potential Extensions

This section discusses the impact and several potential new tasks or applications on the basis of our TrUMAn dataset. Apart from solving our challenging task and dataset, future work might be interested in extending our work to build a new dataset or an application.

Search and Recommendation

Retrieving web content based on queries or watching history is essential for various web applications. A user might seek content that he/she is interested in based on cues beyond shallow textual or visual semantics. For example, looking for a bittersweet movie. Our TrUMAn provides a test-bed to examine learning systems’ capability of search and recommendation. For example, future research could utilize our trope annotations and categories to formulate a trope-based video recommendation task, i.e., recommending a video based on another video with the same or a similar trope.

Trope-based Video Description Generation

Summarizing a video or a document with natural language sentences is a crucial task that has been studied for years. Conventional benchmarks such as MSVD (Chen and Dolan, 2011) or MSRVTT (Xu et al., 2016) mostly focused on captioning a video based on action signals. As our TrUMAn presents both videos and corresponding descriptions, future work could leverage these video-description pairs to learn to generate descriptions based on video clips.

Disentangling Motivation behind Actions

Some significantly dissimilar tropes are portrayed in similar actions or events. For example, asshole victim and heoric sacrifice are sharply different but could both displayed by “someone’s death” in a video clip or a novel. By using these tropes and associated videos in TrUMAn, future study might want to explore disentangling deeper cognition such as motivation from video representation and develop downstream applications.

3. Related Work

Video QA Datasets

Video QA datasets were widely used to evaluate machine capability of understanding a video. Early Video QA works such as Zeng et al. (2017) and Xu et al. (2017) leveraged existing video datasets with captions and annotated question-answer pairs using a text question generation tool (Heilman and Smith, 2010)

. The assumption behind these datasets is that a machine needs to understand the video content in order to answer a question. In these datasets, a set of answers (usually around 1,000) was pre-defined and classified into several categories. Most answers in Video QA datasets are an entity (e.g., dog) or an action (e.g., dancing). Therefore, Video QA datasets could be narrowed down to action and object recognition tasks according to a text query to some extent.

Yu et al. (2019) proposed a human-annotated Video QA dataset, Activitynet-QA (Anet-QA), and extended the question types to include color, location, and spatial and temporal relations. However, Anet-QA did not incorporate causal and motivational queries and therefore could not examine the machine capability of deep cognition skills.

Figure 2. Word cloud of trope categories. Size of words are proportional to the frequency of tropes in the new collected dataset TrUMAn. (See Section 4.1)

Movie Understanding (MU) Datasets

MU datasets, while shared some properties with Video QA datasets, focused more on deeper reasoning capability (e.g. “why” questions). Most MU datasets were formed as multiple-choice movie question answering (MC-MQA) and properly designed distractor options to examine the machine reasoning capability. MovieQA (Tapaswi et al., 2016) labeled 15,000 multi-choice questions associated with 400 movies. While opening the research of movie understanding, the main drawback of MovieQA is that questions are labeled using plot synopses instead of movies themselves. TVQA (Lei et al., 2018) collected 6 TV series and annotated 100,000 multiple-choice questions according to the videos. TVQA dataset mainly focused on temporal relations (i.e. All questions consisted “before”, “after”, or “when”). VIOLIN (Liu et al., 2020) proposed a Video-and-Language Inference task where positive-negative statement pairs were provided, and the model was asked to determine which one was correct. While MU datasets provided deeper questions to evaluate the machine reasoning capability, recent research (B. Jasani and Ramanan, 2019; Winterbottom et al., 2020; Yang et al., 2020) suggested that models tended to overfit language queries (questions or language inference). Therefore, a learning model might reach a high score by utilizing bias instead of understanding movie contents. Our dataset, In contrast, requires the model to process raw signals to perform the trope understanding task.

Trope Understanding

Tropes were introduced to the multimedia community by Smith et al. (2017). Recently, Chang et al. (2021) proposed a Trope in Movie Synopses (TiMoS) dataset with about 6,000 movie synopses and 95 associated tropes. Different from Video QA datasets, the trope dataset aimed to examine the machine capability of deep cognition, including but not limited to consciousness, systematic generalization, causality, and motivation. As trope detection tasks do not require additional queries, machines cannot capture bias in language queries. However, the inputs of TiMoS are movie synopses instead of movies themselves. On the other hand, our TrUMAn directly inputs movie contents, including video and audio, which fits real-world scenarios such as search or recommendation.

4. TrUMAn Dataset

4.1. Overview

We present a novel dataset TrUMAn (Trope Understanding in Movies and Animations) which includes (1) 2423 videos with audio, (2) 132 tropes along with human-annotated categories. We also include human-annotated video descriptions in our dataset to compare the domain gap between raw visual and audio signals and human-written text. We classify tropes into 8 categories by their properties. As categories are not orthogonal, some tropes belong to 2 or more categories. Figure 2 shows the trope clouds for each category, and the size of each trope reflects the frequency in the dataset.

Character Trait tropes focus on a specific role and their characteristic. The trait would be portrayed with their behavior instead of a direct description, such as Big Bad, which shows someone in the video with evil plans and causes all the bad things to happen.

Role Interaction tropes describe the actions, conversations, or encounters of roles in the video. Bad Boss is when a boss is being mean to their employee. However, a bad boss might be a good father or even a hero.

Scene Identification tropes focus on specific views or objects in a scene. This category could not be solved with only object co-occurrences. For example, playing with fire where a character is able to control fire for their utility, could not be simplified as “someone occurs with fire”.

Situation understanding tropes depict a short-term scenario where there are some events happening. These events could be composed of certain entities, objects, actions, or conversations and convey some information or concepts to the audience. An example is Berserk Button, which represents a character flies into a rage by minor or generally insignificant thing. Detecting this trope requires a model to understand the motivation that triggers someone’s anger.

Story Understanding tropes describe a long-term scenario of the video, usually combined by multiple situations. These tropes need to fully understand what is happening in the video and realize the conversation’s meaning. e.g. “The Reason You Suck” Speech is a character delivers a speech to another character about why he sucks.

Sentiment tropes deliver some emotion by elements in a video, includes but not limited to the scene, conversations, speech, or music. These tropes need to realize the emotions that videos convey to the audience, e.g. Downer Ending is a movie or TV series that ends things in a sad or tragic way, the scene of the videos usually becomes gloomy and the music is often melancholy.

Audio tropes focus on audio, the music tune in the video, the speech or conversation content, or the tone of the speakers. These tropes are hard to classify with only video appearance for humans. Such as Villain Song, knowing what is the character singing in the video and the tune of the music make us identify the trope easier.

Manipulation tropes are tropes where the director uses different photography skills (Running Gag) or script (Shout Out) to interact with the audience. This type of tropes strongly needs deep cognition skills, having the knowledge or related concepts may help to recognize them.

4.2. Data Collection

Trope and Video collection

We collected tropes and videos from a Wikipedia-style database, TVTropes. Each trope is along with the definition, several example videos, and related video descriptions, where these data are annotated by web users. Specifically, we query TVTropes for the example videos from each trope and also the description of the videos.

Trope selection

After video collection, we get more than 10k videos and about 4k different tropes. Since most of the tropes have only a few video examples, we select the most frequent tropes from the data and get 132 tropes and 2423 video examples at last, where each trope has more than 10 examples. Finally, we split the dataset with 2423 videos into 5 splits (495/487/478/483/480, 20.43%/20.10%/19.73%/19.93%/19.81%). In order to composed a 5-fold cross-validation with validation and test set, we further split 12.5% of data from training set (10% of whole data) as validation set. Thus, each training process has training, validation, and test set with a size ratio 7:1:2.

4.3. Data Analysis

Category Avg. Median Min Max Number
C. Trait 77.86 71.00 4.33 237.67 43.95 17
R. Inter. 69.53 57.00 4.67 157.00 42.33 10
Scene Id. 49.60 36.67 3.67 150.00 39.24 18
Story. U 55.51 45.7 3.67 237.67 38.95 37
Situ. U 57.43 45.67 3.67 233.67 42.03 51
Sent. 75.93 68.33 4.00 233.67 41.08 14
Audio 79.39 72.00 4.33 178.00 46.10 22
Mani. 61.34 51.00 6.33 156.67 44.09 8
All 61.90 52.00 2.33 237.67 42.71 132
Table 1. TrUMAn dataset video statistics.

It shows the statistic of video length (in second) and the number of tropes in each category. The average video length is long, and the standard deviation is quite large, shows that videos in TrUMAn dataset are very diverse. 34% of tropes belong to multiple categories, which shows that tropes can be recognized from different aspects. (

denotes standard deviation.) (See Section 4.3)
Time Short(¡ 20 sec) Median Long(¿ 2 min)
(%) 17.62 68.01 14.37
Table 2. The statistic of video tropes’ occurrence in percentage.) (See Section 4.3)

Table 1 summarizes the video tropes statistic in each category. For the whole dataset, the average length of videos is about 1 minute, but the standard deviation of the dataset is quite large, showing that videos in TrUMAn dataset are very diverse. The last column of Table 1 shows the number of tropes. 34% of tropes occur in multiple categories, which shows that tropes can be understood from different aspects. In Table 2, the distribution of different lengths of tropes’ videos shows that about 14% of videos are long (over 2 minutes), and about 17% of videos are short (less than 20 seconds). The variety length of videos makes trope detection harder.

Visual Only Audio Only Visual+Audio
69.0 54.0 77.0
Table 3. Human and machine evaluation result with sampled subset with 100 examples. (Section 4.4)

4.4. Human Evaluation on TrUMAn

To better understand the collected dataset and provide directions for future research, we conduct a human evaluation on TrUMAn. We sample 100 video examples for human evaluation where each human tester was asked to select a trope in 5 trope options. 4 distractor candidates are randomly selected from the rest of the tropes, including 2 in the same category as the answer to make the evaluation more challenging. Note that human annotators in the evaluation are not experts, so human evaluation errors do not indicate an unanswerable example.

Overall comparison

Table 3 shows non-expert human evaluation results. Overall, visual signals play a relatively more important role comparing to audio signals. Without watching the movie (Audio Only), humans could achieve 54.0% accuracy, while the Video Only could lead to 69.0% accuracy. Combining both modalities (Visual+Audio) enhances the human performance to 77.0%, indicating that fusing information from both visual and audio is crucial for understanding a trope.

Trope modality

There are some examples that require a specific modality. For example, many audio category tropes centering on music or dialogues could hardly be recognized with video only. On the other hand, several manipulation category tropes could rarely be detected without videos, such as animation bump or overly long gag. Therefore, using merely a single modality cannot understand all tropes.

Conceivable Tropes

Some tropes seem require a specific modality. However, human could conceive the trope with another modality. For example, a video with villain song could be conceived by watching a villain-like character singing. Additionally, some visual tropes such as groin attack could be conceived with the sound of punches and screaming.

Complement Modalities

Several examples require both audio and video to understand, such as a Screw this, I’m outta here!. With only video available, the human tester misunderstand the clip as Big NO! as the video shows a shot of a person seems screaming. On contrary, with only audio, another human test labeled it as Feud Episode because it sounds like two people are feuding with each other to break up. With both video and audio, a human annotator could correctly get the trope. This suggests that fusing audio and visual features might be a way to tackle with trope understanding task.

External Knowledge

Certain tropes such as getting crap past the radar are represented in more obscure ways because they could violate some censorship standards. As the trope is portrayed with some allusions in literature, history, or memes, readers need to externally understand the allusions in order to comprehend the trope. We observe that this kind of tropes is where human annotators failed with accessing both video and audio. Intuitively, it would also be challenging for machines because it requires machines to learn from very specific knowledge sources.

4.5. Data Availability

The TrUMAn homepage111 provides a brief introduction of Trope and the features and the usage of our dataset. We also display some samples on the page for new researchers to acquaint them with this novel and intriguing task. The data we provide includes:

  • TrUMAn: Trope Understanding in Movies and Animations dataset has five-fold data split files, each split file includes train, validation, and test data. Each example is associated with its video ID, trope name, human-annotated description, and the detected ASR results.

  • Visual features: We provide ResNet-101 (He et al., 2015) and S3D (Xie et al., 2018) features for future researchers, the usage is introduced in our page.

  • Audio features: We also provide SoundNet (Aytar et al., 2016) features for multi-modal reasoning.

Figure 3. Proposed TrUSt model. TrUSt model consists of a multi-modal video encoder module (yellow, left, Section 5.1), our proposed novel Conceptual Storyteller module (blue, top right, Section 5.2), and a trope understanding module (red, bottom right, Section 5.3). First, the Video Encoder module takes multi-modal inputs and encodes a video embedding. Next, the Conceptual Storyteller generates a story embedding according to the video embedding to provide further information for trope understanding without the requirement of human-written descriptions during inference. Furthermore, story embedding generation guides the video encoder with additional signals. Finally, the trope understanding module predicts the trope according to input video embedding and generated story embedding.

5. Trope Understanding and Storytelling (TrUSt) Model

To take the first step to tackle the challenging TrUMAn, we propose a new Trope Understanding and Storytelling (TrUSt) network with three modules: (1) Video Encoder (Section 5.1), which encodes a multi-modal video clip into a video embedding vector. (2) Conceptual Storyteller (Section 5.2), which generates a story embedding vector according to the video embedding. The story embedding vector is optimized by minimizing the distance to video description vector encoding by a pre-trained contextual embedding model (e.g., BERT). (3) Trope Understanding (Section 5.3), which performs trope classification based on video embedding and story embedding vectors.

5.1. Video Encoder

The video encoder module takes N-stream inputs where each stream represents a modality such as visual or audio features, and outputs a video embedding .


, where are trainable parameters.

Note that the module is flexible with additional features and different encoder architectures. In this work, we use visual, audio, ASR, and object features (See 6.1 for details). For the video encoder, we leverage and slightly modify previous work (Liu et al., 2020; Huang et al., 2020) for video-and-language inference and video question answering. First, in each stream, the input feature is encoded with a per-modality (e.g., audio) encoder:


, where is encoded and stands for trainable parameters.

Afterward, we concatenate all encoded features:


The encoded video embedding is then utilized to represent the input video.

5.2. Conceptual Storyteller

We design a novel conceptual storyteller to guide the model by leveraging video descriptions. The intuition behind the module design is two-fold. First, by learning to tell a story, the video encoder could receive further signals from the video description. Second, the model-generated story is then fed into a trope understanding module to augment the video embedding without the need for human-written descriptions during inference.

Given a video embedding vector , the module generates a story embedding vector:


, where refers to trainable parameters. Instead of actually generating the story by tokens (i.e. video captioning), we generate a story embedding vector on a latent space for two reasons. (1) A vanilla encoder-decoder model requires an argmax operation to generate a token and the gradient could not be back-propagated. (2) Comparing to conventional video-to-text tasks and datasets, the word distribution of video descriptions is much sparser. Furthermore, we have much fewer examples available.

To optimize the generated story embedding vector, we leverage human-written descriptions and a freezed, pre-trained contextual encoder to obtain contextual embedding:


, where we utilize BERT-base (Devlin et al., 2019) in this work.

Then, we minimize the story loss, the distance between generated story embedding and embed descriptions :


, where we use cosine similarity as our distance function.

5.3. Trope Understanding

The trope understanding module utilizes video embedding and our generated story embedding to predict a trope. First, we generate a trope distribution :


, where are trainable parameters.

Then, we apply a cross entropy for trope loss on and ground truth :


Finally, we weighted sum the losses together, where and is pre-defined hyper-parameter:

Method Modality 5-fold Acc Category
(r)1-4 (r)5-13 Visual Audio Object (%) C. Trait R. Inter Scene Id. Story. U Situ. U Sent. Audio Mani.
Baseline CNN - - 13.97 5.73 9.63 6.01 8.45 7.30 17.34 2.76
S3D - - 13.70 8.85 11.85 10.94 9.73 13.87 18.60 12.71
- Sound - 7.95 3.12 1.48 3.54 2.57 6.93 10.57 4.97
- ASR - 9.86 5.73 6.30 3.24 3.96 8.76 8.67 2.21
S3D Sound - 15.07 9.90 12.96 7.70 8.66 10.22 18.39 10.50
S3D ASR - 13.97 10.42 14.44 8.78 11.12 14.23 20.93 11.60
L-GCN (Huang et al., 2020) - - frcnn 9.04 6.77 11.48 5.24 8.34 6.93 10.15 8.84
XDC (Alwassel et al., 2020) 10.96 3.12 6.67 5.86 5.88 7.30 15.22 6.63
[1pt] TrUSt S3D ASR - 17.26 19.79 9.63 10.63 10.91 17.52 26.00 7.18
S3D ASR frcnn 13.94 19.73 16.67 15.93 10.32 13.69 18.98 20.72 13.26
[1pt] Oracle (w/ written description) 40.55 31.77 31.48 17.41 26.42 30.29 32.98 21.55
Table 4. The experimental result with 5-fold cross validation. First block: The baseline model with different modalities. The model using S3D and ASR reaches the highest score of 12.01% accuracy. Second and third blocks are two state-of-the-art methods in Video QA and action recognition, they can achieve only 78% accuracy. Fourth block: Our Trope Understanding and Storytelling model: TrUSt, using raw signals (visual, audio, and objects) for conceptual storyteller and trope understanding reaches the best score at 13.94%. Last block: Oracle case accessing human-written descriptions could achieve 27.84% accuracy, which is remarkably better than all compared methods. (See Section 6.3)

6. Experiments

6.1. Modality


For visual signals in videos, we extract static appearance and dynamic motion features in a video to evaluate how the static and dynamic information affect trope understanding. Specifically, we apply ResNet pre-trained on ImageNet image classification to extract appearance features, and use S3D pre-trained on Kinetics action recognition to extract motion features. We denote appearance features as

, and motion features as , in which N is the number of frames. In our experiments, we set N to 100 and truncate the frames longer than it.222Only 0.1% of videos in TrUMAn (2 in 2423) are truncated. Both ResNet and S3D features are extracted with 0.5 fps, the dimensions of the features are 2048 and 1024.


For audio features, we use (1) SoundNet to encode all the sound that appears in the video, includes music, speech, natural sound, etc. The feature denotes as , , K is the length of an audio signal, and the feature dimension is 1024. (2) ASR model from Google Cloud Speech to Text API to extract speech transcript in the videos, and use BERT (Devlin et al., 2019) to encode the transcript. We denote the feature as , , L is the length of the speech, and the feature dimension is 768.


L-GCN (Huang et al., 2020) requires local object features to input the graph-based model. We use Faster-RCNN (Ren et al., 2015) to extract object features, where Faster-RCNN was pre-trained on Visual-Genome. The object set , o is the k th object detected at n-th frame with dimension 2048, and l is the spatial location of each object. Each frame has K detected object and the total length of the frame is N.

6.2. Compared Methods

We examine trope understanding capability of modern learning systems on our TrUMAn dataset, including a modified VIOLIN (Liu et al., 2020) model as our baseline, a modern GCN based Video QA model, L-GCN (Huang et al., 2020), and a state-of-the-art self-supervised cross-modal pre-training method, XDC (Alwassel et al., 2020). To reveal gaps from machine to human, we also evaluate (1) Oracle model using human-written descriptions and (2) Human performance with sampled examples.


Follow previous work (Liu et al., 2020; Lei et al., 2018), we use LSTM network to encode visual and audio features. Since trope detection is different from Video QA and video inference problems, we remove the question (query)-video attention block in (Liu et al., 2020) as our baseline.


L-GCN (Huang et al., 2020) is a location-aware graph-based model, where it models the relation between objects in all the video frames, and shows the power on video QA tasks. We remove the QA block in the model and adjust it for our task.


XDC (Alwassel et al., 2020) is the state-of-the-art self-supervised method that leverages video and audio signals in the video for action recognition and audio classification tasks. We use the released visual model pre-trained with IG-Kinetics dataset.


Different from those raw signals we extracted from the videos, we have collected the human-annotated video descriptions. These descriptions directly point out the most important part of the videos and act as a guide to realize the tropes instead of through the videos. For these descriptions, we use BERT as a feature extractor to extract the feature from these descriptions as how we preprocess the ASR video transcript. These features are pass to our baseline model that we can easily compare the capability of raw signals and oracle annotations.

6.3. Results and Discussion


The first to fourth rows of the first block of Table 4 shows the single-modal performance of the baseline model. All variants reach at best 11.60% accuracy, indicating that trope understanding is a challenging task. Also, the visual model ( accuracy) has a better performance compared to the audio model (), even in the Audio category. This echos the human evaluation in Section 4.4 where visual signals could play a slightly more crucial role in the task. Additionally, audio signals could either be too sparse (sound) or suffer information loss (ASR). The ASR model, while ignores music, performs better than the sound model, especially in the scene identification (third) category where scene-related concepts might be mentioned in the dialogues. The fifth and sixth rows of the first block demonstrate multi-modal baseline model performance. Fusing visual and ASR features generally improves the accuracy to 12.01%, revealing that modalities are complementary.

Existing state-of-the-art method capability

The second block of Table 4 shows the performance of L-GCN (Huang et al., 2020), a state-of-the-art Video QA model using detected object features. The model has the accuracy between audio models and visual models, revealing that the composition of detected objects does not represent tropes well. The third block shows XDC (Alwassel et al., 2020), a self-supervised audio-video clustering approach that reaches state-of-the-art on action recognition tasks. The model performs significantly better on the audio category despite only inputting visual signals, which shows that cross-modal audio-visual pre-training could help the model to capture audio-related deep semantics in the video, and also echos the conceivable tropes we mentioned in human evaluation. However, it does not perform well overall (7.35% accuracy), revealing that generally transferring visual semantics to trope understanding remains challenging.


The fourth block of Table 4 shows the results of our Trope Understanding and Storytelling model: TrUSt. As shown in the first row, our Conceptual Storyteller component boosts the baseline model to 12.88% accuracy. The performance of most categories rises, especially the character trait, role interaction, and audio. It is reasonable as the character trait and the role interaction require understanding role characteristics or intentions behind their actions. Therefore, storytelling helps the model to apprehend them. Also, audio performance is improved because the story embedding complements raw signals. Specifically, without Conceptual Storyteller, the model might get that the video is audio-related but does not comprehend the exact one. i.e., misunderstanding a “Title Theme Tune” as a “Villain Song”. The second row demonstrates the performance of TrUSt with an additional frcnn stream. We further boost the performance to 13.94% accuracy, which is the best among all non-oracle models. TrUSt with only S3D+ASR drops the performance in scene identification, which might stem from the story embedding with sparse raw signals instead of specific objects. With the aid of frcnn stream, the story embedding could be generated with both sparse visual (the whole frame) and dense object cues and enhances the scene identification performance to 15.93%, which is also better than multi-modal baseline or L-GCN model alone, and indicates the effectiveness of the proposed Conceptual Storyteller module.

Oracle model w/ human-written description

It worth noting that the oracle model (sixth block) which accesses the human-written description gets 27.84% accuracy. While it is still far from human performance, the score is more than doubled of the best baseline model. Therefore, we argue that trope understanding in movies and animations is much more difficult than previous proposed synopses trope understanding (Chang et al., 2021), and could examine the machine capability of processing raw signals, which is crucial for real-world applications such as movie recommendation systems.

Modality Model
Baseline TrUSt (w/o ST.Emb) TrUSt
Table 5. Ablation study on TrUSt. The modality of V indicates the video features, A is ASR features, and O stands for object features. Three different types of models are baseline model, TrUSt without feeding story embedding back to trope understanding module (multitask structure only), and our full TrUSt model. The results show that multitasking alone can guide the video encoder to represent the video embedding better with adequate input signals (V+A+O). The full TrUSt model further improves about 0.91.5% of accuracy to both of the variants.

Ablation study on TrUSt

Table 5 demonstrates the ablation study on TrUSt to indicate the effectiveness of each proposed module. We compare (1) TrUSt, (2) TrUSt without feeding story embedding back to trope understanding module (multitask structure only), and (3) the baseline methods. The first row of Table 5 displays the result with visual and audio signals, and the second row shows the result with visual, audio, and object signals. The second column shows a comparable result to the baseline model without (first row) and 1.0 performance gain with object signals (second row). It demonstrates that multitasking alone guides the video encoder with adequate input signals. Also, as shown in the third column, the full TrUSt further boosts accuracy by leveraging the story embedding.

Figure 4. Qualitative results. The first two cases show the effectiveness of TrUSt. Our TrUSt benefits from the story embedding and understands the video implications without requiring human-written descriptions during inference. At the same time, other models can barely realize the shallow meaning of these videos and lead to the wrong answers. The third one is a failure case of the oracle model and TrUSt. The trope shows the weapon’s special effects provided by the raw signals but not the descriptions. Despite generally performing better, our model failed in this case. (See Section 6.3 for details.)

Qualitative Analysis

Figure 4 demonstrates some cases predicted by the oracle model, TrUSt, the baseline model, LGCN, and XDC. The first and second cases show the correct answer predicted by TrUSt and the oracle model; the third case is the failure case of TrUSt. In the first case, the oracle model understands the trope based on the human-written description, indicating that descriptions could hint at the model and simplify trope understanding. Our TrUSt also comprehends the video with the aid of Conceptual Storyteller module. Simultaneously, the baseline and XDC somehow understand the sad mood portrayed in the film. However, they misunderstand that Emma reminisces about the old days as someone sacrificed heroically. The second case is an expository theme of the video. The oracle model and TrUSt perform well in this case. These two cases demonstrate the effectiveness of our proposed Trope Understanding and Storytelling model, which leverages story embedding without needing human-written descriptions during inference. Concurrently, the baseline model, LGCN, and XDC, despite capturing the musicality of the video, fail to apprehend the trope. In the last case, we can observe that only the baseline method gets the correct answer. To predict Shock and Awe, it is crucial to capture the cues of light or the special effects of the weapon provided by the raw visual and audio signals. The oracle model and TrUSt might over-conceive the descriptions (story embedding) and output a wrong answer. Observing this issue, future work might want to investigate the border of storytelling and over-conceiving to further improve the performance.

7. Conclusion

In this work, we propose a novel task, along with a new dataset, TrUMAn. Different from existing datasets, TrUMAn requires deep cognition skills to comprehend causality and motivation beyond visual semantics, and could not be solved with language bias and implications. By developing machines with trope understanding capability, various data mining applications and algorithms could be taken to the next level. To tackle this challenging task, we present a new model, TrUSt, which jointly perform storytelling and trope understanding with a novel module Conceptual Storyteller. Experimental results show that learning systems that perform well on conventional video comprehension benchmarks reach at most 12.01% accuracy, revealing the room for improvement for modern learning systems. Our proposed Conceptual Storyteller boosts the model performance by 2% accuracy and reaches the state-of-the-art performance of 14% accuracy. Additionally, the oracle case that using human-written description instead of raw signals could boost machine performance to 28% of accuracy, indicating that our dataset is more challenging than trope understanding dataset with movie synopses. Therefore, we believe that our proposed task and dataset could lead to a new path for future research and applications.


This work was supported in part by the Ministry of Science and Technology, Taiwan, under Grant MOST 110-2634-F-002-026. We benefit from NVIDIA DGX-1 AI Supercomputer and are grateful to the National Center for High-performance Computing.


  • H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran (2020)

    Self-supervised learning by cross-modal audio-video clustering

    In NeurIPS, Cited by: §1, Table 4, §6.2, §6.2, §6.3.
  • Y. Aytar, C. Vondrick, and A. Torralba (2016) SoundNet: learning sound representations from unlabeled video. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pp. 892–900. Cited by: 3rd item.
  • R. G. B. Jasani and D. Ramanan (2019) Are we asking the right questions in MovieQA?. In ICCV Workshops, Cited by: §1, §3.
  • Y. Bengio (2019) From system 1 deep learning to system 2 deep learning. NeuripS. Cited by: §1.
  • C. Chang, H. Su, J. Hsu, Y. Wang, Y. Chang, Z. Y. Liu, Y. Chang, W. Cheng, K. Wang, and W. H. Hsu (2021) Situation and behavior understanding by trope detection on films. In WWW, Cited by: §1, §1, §1, §3, §6.3.
  • D. Chen and W. Dolan (2011) Collecting highly parallel data for paraphrase evaluation. In ACL, Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. Cited by: §5.2, §6.1.
  • K. He, X. Zhang, S. Ren, and J. Sun (2015) Deep residual learning for image recognition. Cited by: 2nd item.
  • M. Heilman and N. A. Smith (2010) Good question! statistical ranking for question generation. In HLT-NAACL, Cited by: §3.
  • D. Huang, P. Chen, R. Zeng, Q. Du, M. Tan, and C. Gan (2020) Location-aware graph convolutional networks for video question answering. In AAAI, Cited by: §1, §5.1, Table 4, §6.1, §6.2, §6.2, §6.3.
  • K. Kim, M. Heo, S. Choi, and B. Zhang (2017) DeepStory: video story QA by deep embedded memory networks. In IJCAI, Cited by: §1.
  • J. Lei, L. Yu, M. Bansal, and T. Berg (2018) TVQA: localized, compositional video question answering. In EMNLP, Cited by: §1, §3, §6.2.
  • J. Liu, W. Chen, Y. Cheng, Z. Gan, L. Yu, Y. Yang, and J. Liu (2020) Violin: a large-scale dataset for video-and-language inference. In CVPR, Cited by: §1, §3, §5.1, §6.2, §6.2.
  • S. Ren, K. He, R. Girshick, and J. Sun (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, pp. 91–99. Cited by: §6.1.
  • J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota (2017) Harnessing a.i. for augmenting creativity: application to movie trailer creation. In ACM MM, Cited by: §3.
  • M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler (2016) MovieQA: understanding stories in movies through question-answering. In CVPR, pp. 4631–4640. Cited by: §1, §3.
  • T. Winterbottom, S. Xiao, A. McLean, and N. A. Moubayed (2020) On modality bias in the tvqa dataset. In BMVC, Cited by: §1, §3.
  • S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    Cited by: 2nd item.
  • D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang (2017) Video question answering via gradually refined attention over appearance and motion. In ACM Multimedia, Cited by: §1, §3.
  • J. Xu, T. Mei, T. Yao, and Y. Rui (2016) MSR-vtt: a large video description dataset for bridging video and language. In CVPR, Cited by: §2.
  • J. Yang, Y. Zhu, Y. Wang, R. Yi, A. Zadeh, and L. Morency (2020) What gives the answer away? question answering bias analysis on video qa datasets. In Human Multimodal Language Workshop, Cited by: §1, §3.
  • Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao (2019) ActivityNet-qa: a dataset for understanding complex web videos via question answering. In AAAI, Cited by: §1, §3.
  • K. Zeng, T. Chen, C. Chuang, Y. Liao, J. C. Niebles, and M. Sun (2017) Leveraging video descriptions to learn video question answering. In AAAI, Cited by: §1, §3.