Log In Sign Up

Effectively leveraging Multi-modal Features for Movie Genre Classification

Movie genre classification has been widely studied in recent years due to its various applications in video editing, summarization, and recommendation. Prior work has typically addressed this task by predicting genres based solely on the visual content. As a result, predictions from these methods often perform poorly for genres such as documentary or musical, since non-visual modalities like audio or language play an important role in correctly classifying these genres. In addition, the analysis of long videos at frame level is always associated with high computational cost and makes the prediction less efficient. To address these two issues, we propose a Multi-Modal approach leveraging shot information, MMShot, to classify video genres in an efficient and effective way. We evaluate our method on MovieNet and Condensed Movies for genre classification, achieving 17 (mAP) over the state-of-the-art. Extensive experiments are conducted to demonstrate the ability of MMShot for long video analysis and uncover the correlations between genres and multiple movie elements. We also demonstrate our approach's ability to generalize by evaluating the scene boundary detection task, achieving 1.1 state-of-the-art.


page 2

page 10

page 20

page 21

page 22


Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation

Temporal video segmentation and classification have been advanced greatl...

Show and Recall: Learning What Makes Videos Memorable

With the explosion of video content on the Internet, there is a need for...

Multi-modal Summarization for Video-containing Documents

Summarization of multimedia data becomes increasingly significant as it ...

Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text

The YouTube-8M video classification challenge requires teams to classify...

Language-based Video Editing via Multi-Modal Multi-Level Transformer

Video editing tools are widely used nowadays for digital design. Althoug...

Online Multi-modal Person Search in Videos

The task of searching certain people in videos has seen increasing poten...

1 Introduction

Movie genre plays an important role in video analysis by reflecting the narrative elements, aesthetic approaches, and emotional responses. Developing a reliable video genre classification method (see Figure 1 for an illustration) enables a wide range of applications such as organizing similar user videos from social media sites, correcting mislabeled videos, highlighting key frames from long videos, retrieving a particular type of film for recommendation systems, among others [17, 16]. Motivated by these applications, researchers have applied different frameworks  [51, 35, 17, 7, 16] for genre classification.

Figure 1: Given a video such as trailer or movie clip, genre prediction is a multi-label classification problem. Columns 1-3: genres such as animation, romance, Sci-Fi can be classified well based on visual modality as shown in prior work. However, as we will show, genres in columns 4-5 rely on high-level semantics such as musical, biography, which are discarded by prior work (e.g.,[16]). Thus, in our approach, which we refer to as MMShot, we base our predictions on both visual and non-visual modalities like audio and language, achieving significantly improved performance over prior work.

Early work [29, 5, 51, 35] on genre classification focused on several specific categories and used small-scale datasets. Limited by the scale of dataset, these methods can only do image-based (posters or still frames) genre classification rather than video-based genre classification. In recent years, large-scale datasets for video genre prediction such as MovieNet [16] and Condensed Movies [1] have been introduced. Based on large-scale benchmarks, deep video encoders for action recognition [45, 6] or temporal relational reasoning [49] were applied to video-based genre classification. However, these methods process videos at the frame level, which results in high computational cost and thus makes it extremely inefficient to process long movies. To classify videos in an efficient way, [17] proposed a shot-based video encoding approach by dividing videos into separate shots and using shot representations to predict genres. Although this method can efficiently learn from videos, it only pays attention to the visual modality while ignoring other important modalities such as audio or language.

Figure 2:

The pipeline of MMShot. Our approach mainly consists of 3 steps: (1) extracting non-visual modalities from input data. Specifically, audio modality is associated with the given videos. We leverage an ASR system to get the caption corresponding to each video; (2) encoding multi-modalities by different encoders. A keyword extraction method is employed to alleviate the influence of noise introduced by the ASR system; (3) applying fusion strategies on feature representations of different modalities. See Section

3 for specific details.

In this paper, we mainly investigate two questions: 1) How to effectively leverage multi-modal features to classify genres of a given video? 2) How to analyze genres of long videos? We propose a Multi-Modality approach leveraging Shot information (MMShot) to effectively and efficiently predict video genres. Specifically, we note that high-level semantics such as storylines and background could be implicitly pointed out by narrators or background music. Thus, MMShot leverages both the audio and spoken words in addition to visual modality to further improve the performance. This is in contrast to prior work, e.g., [17, 16], that discards this information by relying only on the visual clues in the video. To enable long video analysis, MMShot combines a sliding window approach with a shot-based mechanism that first divides a video into separate shot components and then averages the shot features for prediction. Inspired by current video processing methods using sparse sampling [49, 21], we subsample frames from each video segment to create the shot representation.

Figure 2

presents the overall pipeline of MMShot. Specifically, we begin by extracting the audio modality, which is naturally accompanied with the input video. Then the language modality is obtained by an Automatic Speech Recognition (ASR) system 

[34]. Our experimental results show that the raw captions recognized by our ASR system contain noise that can even drop the performance of our genre classification system. To address this problem, we propose an approach for extracting keywords to obtain a less noisy language representation, boosting performance.

Our contributions are summarized as:

  • We propose a multi-modal framework (MMShot) to introduce audio and language information to video genre classification task. The incorporation of higher-level semantics and background music helps classify genres where visual-based models could fail. In contrast to prior work [7], MMShot extracts the audio and language information purely based on the input videos and does not rely on extra sources like Wikipedia, meta data, or movie posters.

  • We introduce a keyword extraction algorithm to the language modality, alleviating the issue caused by noisy captions recognized from audio.

  • Genre prediction results on MovieNet [16] and Condensed Movies [1] demonstrate that MMShot notably outperforms the state-of-the-art, improving roughly 17%21% mAP points using only 41% of the training data.

  • We transfer our model to scene boundary detection task, achieving new state-of-the-art and demonstrating the generalization of MMShot.

  • Extensive experiments are performed to demonstrate the ability of MMShot to analyze long videos and uncover the correlations between movie elements and genres.

2 Related Work

Studies on Movies span a great number of research topics including genre classification [17, 16, 7], scene boundary detection [27, 10], shot boundary detection [37, 36], person re-identification [47], action recognition [4, 33, 48], alignment between movie and text descriptions [52, 39, 11], understanding relationships of film characters [23, 46, 2, 20], movie question answering [40, 44, 18], scene and event understanding [10, 32], and many others. Many existing works understand movies from a visual perspective or align the visual modality with corresponding labels in other modalities such as actions, text descriptions, among others. In this paper, we investigate the effect of audio and language modalities have on genre classification. Compared to a related method Moviescope [7], our method MMShot extracts multi-modal features based solely on the input video without requiring extra overhead, i.e., Moviescope relies on additional information including posters, Wikipedia, and metadata. In contrast, MMShot extracts audio and language information based on input videos and leverages the additional modalities for free.

Movie Genre Classification can be divided into two major categories: image-based (posters, still frames, etc.) [51, 35, 16] or video-based (trailers, movie clips, etc.) [17, 16]. Recently, researchers have transferred popular video recognition frameworks to movie genre classification such as methods on action recognition [42, 45, 49, 22, 9], and video summarization [24, 43]. An obstacle for these frameworks is the computational cost. Methods [42, 45] that take all frames as input would be infeasible to handle videos with hours’ duration. Though sparse sampling strategies [49, 21] have been proposed to process videos more efficiently, the analysis of hour-long videos would still cost significant resources. To address this issue, we adopt a shot-based approach [16] to first divide long videos into separate shots. Then we introduce a sliding window mechanism on shot representations to process long videos efficiently.

Scene Boundary Detection tries to localize the beginning and end of different scenes in videos. Early methods [28, 31]

use unsupervised learning to do the scene detection based on the color similarity of shots. Because several human-annotated labels dataset 

[30, 3]

were proposed recently, many supervised learning approaches 

[3, 25, 27, 30]have been proposed. A major step in this direction was taken by the recent proposed dataset MovieNet [16], 1,100 movies are released and 318 of them are annotated with scene boundaries.

3 MMShot: Multi-modal Approach leveraging Shot Information

Video genre classification is a multi-label classification task aiming to predict genres via input videos. As illustrated in Figure 2, the standard input is a video such as a trailer or a movie clip, and the output is a single or multiple corresponding genres. Given an input video, MMShot first extracts audio information associated with the video and then applies an Automatic Speech Recognition (ASR) system [34] and a keyword extraction algorithm to filter representative text. A fusion module and a classifier are applied to incorporate multi-modal representation to predict genres (summarized in Figure 2). In this section, we first introduce the encoders used by MMShot in Section 3.1. Our approach to effectively leverage language information is then introduced in Section 3.2. After that, we present fusion strategies and details of MMShot in Section 3.3.

3.1 Multi-modal Encoders

Visual Modality. Considering potential cases to analyze long videos, videos encoders that take all frames as input [42, 6] would be computationally expensive and inefficient. Therefore, we adopt a shot-based mechanism, which is similar to MovieNet [16] to extract visual representation. Specifically, we first divide the input video into separate shots and consider each shot as the basic unit of input. For each shot , we uniformly sample frames and use average features of the frames as the representation of shot


where represents the feature extractor and is the corresponding parameter. According to shot representation , we further get the video representation by


Motivated by the powerful ability of large vision-language models such as Contrastive Language-Image Pretraining (CLIP) 

[26] on video-text retrieval tasks, we adopt the image encoder of CLIP as our backbone, resulting in the video representation of dimension 512.

Audio Modality.

We apply a large pretrained audio pattern recognition network PANNs 

[19] to extract the information for audio modality. PANNs is a Wavegram-Logmel CNN-based model trained on AudioSet [14]. To get the audio representation of input video, we first re-sample audio files to 16 kHz, making the sampling rate be consistent with our ASR system Silero [34]. Then the resampled audio waveforms are provided to PANNs as input. We remove the classifier head of PANNs to get 2048-dimensional audio embeddings .

3.2 Language Representation

Speech-to-Text Recognition. While some videos come with captions, there is a considerable amount of movie clips and trailers that do not have captions. To circumvent the dependence on provided captions, we incorporate an ASR model, Silero [34], to generate captions in MMShot. In other words, the input of MMShot only spans visual and audio modalities, the language modality is extracted from audio modality and our model leverages it for free.

Keywords Extraction. A straightforward method to incorporate the language information is to directly apply a language encoder to the captions extracted from audio waveforms. However, as shown in our experiments in Section 4.2, we find that directly applying a language encoder like BERT [13] on raw captions might even hurt the performance of our multi-modal method. We attribute it to the fact that the ASR system cannot recognize all language tokens perfectly. Therefore, the extracted captions contain a lot of noise that might affect the genre prediction results. Thus we apply a keyword extraction algorithm to solve this issue.

Motivated by the intuition that Nouns, Pronouns, and Adjectives usually contain important clues to describe events in videos, we identify each word’s part-of-speech using SpaCy [15]. Given these tokens, we select top 222In our experiments, we set to 20. tokens with high frequency that appear in captions. The top

tokens are considered as keywords and we apply the text encoder of CLIP to extract the 512-dimensional feature vectors as our language representation

for each sample.

3.3 Fusion Strategy

Fusion strategy plays a crucial role in effectively combining multi-modal features. In this Section, we perform three fusion methods to explore what strategy is best fit for MMShot.

Early Fusion means to directly concatenate the features embeddings , , and from multiple modalities. The fully-connected layer is applied following the concatenation to get the final prediction score :


where , and are learnable parameters.

denotes the activation functions, where we use ReLU for hidden layers. Since movie genre prediction is a multi-label classification problem, we apply sigmoid function to map the last node of each genre to probability


Intermediate Fusion merges the intermediate features obtained by separate models trained on different modalities. Take the visual modality as the example, we have


where , and are learnable parameters for visual modality and is optimized using the same loss to in Eq. 4. Given intermediate features, the final prediction is calculated by concatenating , , and :


Late Fusion is a strategy that the multi-modalities are not fused until the last layer of the model. The prediction score can be considered as the average of , , and .

Loss Function. Considering video genre prediction is a multi-label classification task, we apply binary relevance strategy to train our model. The prediction head of MMShot is an ensemble of single-label binary classifiers where each classifier predicts whether the video contains a specific genre. The union of these predicted genres is taken as the final output. As a result, we use Binary Cross Entropy loss to train each classifier and average these losses to train MMShot.

4 Experiments

4.1 Datasets and Experiment Settings

Datasets. We evaluate MMShot on MovieNet [16] and Condensed Movies [1]. The released version of MovieNet contains 1.1K movies and 30K trailers. According to the provided trailer URLs, we downloaded source videos from youtube. After filtering out invalid links and unlabeled trailers, we got 28,466 trailers in total. Following [16], we randomly split the 28K trailers into training, validation, and test set by ratio 7:1:2. Condensed Movies consists of 33K movie clips from 3,600 movies. After we processed Condensed Movies using the same procedure as MovieNet, we get 22,174 movie clips. We split the dataset into 15,521 training clips, 2,217 validation clips, and 4,436 testing clips.

Metrics. Following [16]

, we adopt recall@0.5, precision@0.5, and mean average precision (mAP) as our evaluation metrics. Here 0.5 is the threshold to distinguish positive prediction and negative prediction. Since the distribution of movie genres is extremely unbalanced, we report the scores at both “macro” level and “micro” level. A “macro” average means calculating metrics for each genre and treating them equally. It weighs each class equally and hence does not take label imbalance into account. In contrast, a “micro” average means calculating the metrics globally. It aggregates the contributions of all classes and therefore deals with label imbalance. In other words, “macro” amplifies the impact of samples belonging to small categories while “micro” considers each sample equally.

Implementation Details. Each input video is split into separate shots by TransNet v2 [36]. We randomly select 8 shots where each shot consists of 3 sampled frames as the visual representation of the input video. We sample audio waveforms at a rate of 16 kHz from each video as the input to both PANNs [19] and Silero [34]. We adopt a BERT encoder [13] to encode raw captions and a CLIP encoder [26] to encode key words. Our models are trained with a batch size of 256 and a maximum learning rate of on an RTX-3090. See the supplementary for more details.

Model macro micro
r@0.5 p@0.5 mAP r@0.5 p@0.5 mAP
TSN [45] 17.95 78.31 43.70 - - -
I3D [6] 16.54 69.58 35.79 - - -
TRN [49] 21.74 77.63 45.23 - - -
MovieNet [16] 19.52 72.40 44.02 33.32 64.55 53.14
MMShot-V 40.38 70.74 58.82 50.33 73.80 69.63
MMShot-VA 42.06 74.01 61.57 54.26 75.93 72.67
MMShot-VAL 42.21 71.67 60.08 52.46 74.68 70.87
MMShot-VAL (keywords) 42.61 74.69 62.26 55.30 75.27 72.95
Condensed Movies
MovieNet [16] 14.87 61.57 41.33 26.39 68.95 54.83
MMShot-V 35.92 71.28 57.64 46.82 73.83 68.89
MMShot-VA 46.37 70.99 62.65 57.21 71.67 71.75
MMShot-VAL 39.55 65.09 54.75 51.09 63.96 61.65
MMShot-VAL (keywords) 46.40 71.96 62.56 57.08 72.36 71.84
Table 1: Quantitative results of genre classification on MovieNet and Condensed Movies. V, A, and L denote the visual, audio, and language modalities respectively. keywords denotes applying our keyword extraction algorithm on the language modality.

4.2 Genre Classification

We validate the effectiveness of MMShot versus the current state-of-the-art MovieNet [16] and the benchmarks introduced by MovieNet including TSN [45], I3D [6], and TRN [49].

Quantitative Results. Table 1 reports the quantitative scores of MMShot versus baselines on both datasets333Scores of TSN, I3D, and TRN are cited from [16]. We reproduce the architecture of [16] to apply it on both MovieNet and Condensed Movies datasets. It should be mentioned that the scores reported in [16] are trained on 68K trailers while we trained our models on the released 28K trailers, approximate 40% of the training data used in MovieNet.. We observe that our approach, MMShot, significantly outperforms the baselines on both datasets, improving 17%21% on macro-mAP and 17%19% on micro-mAP. This demonstrates that MMShot boosts performance not only over all samples but also on samples of imbalanced genres. Though our models do not achieve the best performance across all metrics, e.g., TSN got the highest value on p@0.5 and a poor value on r@0.5, MMShot achieves the best performance on the comprehensive metric, mAP, illustrating it has a better trade-off among various metrics. We draw three conclusions from the table: (1) The knowledge of CLIP model that is proposed for image-text retrieval can also promote the performance on video classification task (MMShot-V vs. baselines); (2) Effectively leveraging multi-modal features improve the model based solely on visual modality (MMShot-V vs. MMShot-VA); (3) While noisy captions even harm the performance of our models, our keyword extraction algorithm can effectively filter useful information and filter out noise from captions, further boosting the performance of MMShot (MMShot-VA vs. MMShot-VAL vs. MMShot-VAL (keywords)).

Modality macro micro
r@0.5 p@0.5 mAP r@0.5 p@0.5 mAP
Visual 40.38 70.74 58.82 50.33 73.80 69.63
Audio 21.31 62.85 42.54 37.73 69.83 59.05
Language 14.33 30.45 23.63 29.81 51.38 39.85
Language (keywords) 11.16 63.48 29.17 19.98 59.07 42.54
Table 2: Impact of separate modalities.
Fusion Strategy macro micro
r@0.5 p@0.5 mAP r@0.5 p@0.5 mAP
Early Fusion 43.34 72.62 61.38 54.36 75.63 72.73
Intermediate Fusion 42.06 74.01 61.57 54.26 75.93 72.67
Late Fusion 44.16 71.17 60.91 55.71 74.24 72.24
Table 3: Impact of Fusion Strategies.

Ablation Study. Table 2 shows the performance of each modality in isolation has on genre classification on MovieNet. Overall, we find that visual modality is the most effective in classifying genres by itself. However, the audio and language modalities also play a crucial role in genre prediction. We also note that using just the raw Language features themselves had no impact on performance. In contrast, our approach boosts performance, further validating the effectiveness of our keyword extraction algorithm.

Table 3 presents the effect of fusion strategy. Specifically, we apply three different fusion strategies discussed in Section3.3 to combine the visual and audio modalities on MovieNet. From the table, we observe that Intermediate Fusion and Early Fusion have comparable performance, outperforming Late Fusion. Since Intermediate Fusion has higher macro-mAP than Early Fusion, we adopt Intermediate Fusion on MMShots for the remaining experiments.

Long Movie Analysis. As discussed in Section 1, a challenge of movie genre classification is the analysis of long videos. Since MMShot considers the video shot as the basic unit, we apply MMShot on long videos using a sliding-window manner. Accordingly, MMShot is able to get a sequence of labels from long video input. These labels can not only be applied to genre classification but also be used for shot retrieval.

We analyze “Transformers: Revenge of the Fallen” as an example for long video analysis in Figure 3. As shown in the figure, MMShot not only returns the correct shots according to the ground truth genres but also generalizes well on genres that do not belong to the ground truth. Specifically, genres of “Transformers: Revenge of the Fallen” are Sci-Fi and Action, whose corresponding shots are presented in Figure 3(a) and Figure 3(b). Consistent with our expectations, shots that are classified as Sci-Fi consist of scenes like the universe, planets, or robot armies. Shots that are classified as action show up together with common elements in action movies such as explosion, moving, etc. We present two additional genres, Romance and War, in Figure 3(c) and Figure 3(d). MMShot successfully shows a series of related shots with these two genres. For example, in Figure 3(d), shots that include weapons or soldiers are more likely to be selected by War genre, shots that include daily life or couples are more likely to be selected by Romance. The analysis of long movies can be applied to practical applications such as highlighting movie clips or automatic trailer generation. See the supplementary for more examples.

Figure 3: The analysis of movie “Transformers: Revenge of the Fallen”. See Section 4.2 for discussion

Low-level Visual Feature Analysis. We analyze the distribution of brightness and warm-cold color ratio on MovieNet to uncover the correlation between genres and low-level visual features. As illustrated in Figure 4, we observe that horror film has the lowest brightness value, which is in line with common sense that horror film seeks to make the audiences feel scary and a dark environment serves the purpose. On the other hand, genres such as family, animation obtain high-value brightness distribution, which corresponds to positive emotions such as love and affection that these kinds of movies try to express to audiences. For cold-warm color ratio, western gets the lowest value while Sci-Fi achieves the highest value. Intuitively, western film usually has a sepia tone due to scenes like desert, blazing sun and dirt while Sci-Fi uses cold colors for scenes like the universe, spacecraft, robot armies, etc to express a sense of high-tech and sharpness.

Figure 4:

Low-level visual feature analysis across movie genres. Left: brightness with confidence interval; Right: cold-warm color ratio with confidence interval. See Section

4.2 for discussion.
Figure 5: Representative sound events of Action, Family, Romance and Horror. See Section 4.2 for discussion.
Figure 6: Wordclouds of War, Music, Thriller and Romance. See Section 4.2 for discussion.

Sound Event Analysis. We analyze audio waveforms on MovieNet to uncover the correlations between genres and sound events. We present the representative sound events of 4 different movie genres in Figure 5. From the figure, we validate that sound event (audio modality) is a discriminative attribute to recognize genres. For example, the high frequent sound events of Romance consist of “Singing”, “Music for children”, and “jingle, tinkle” which make people feel relaxed and happy. In contrast, the elements of Action movies are always associated with “Gunshot”, “Scary music”, and “fusillade”, which make people feel thrilled and excited. More examples are provided in the supplementary.

Keyword Analysis. We calculate the Term Frequency - Inverse Document Frequency (TF-IDF) to uncover the correlations between key words and movie genres. Specifically, we create for each genre a table with dimension , where is the number of movies in this genre and is the vocabulary size. represents the TF-IDF value of word in movie and is the score of word in the whole genre. We then plot the wordclouds for each genre. It is worth noting that some words such as “know”, “man”, “think” rank high among most of genres but do not carry real information. To address this issue, we design a mechanism where we first combine the top N words into a list (size of ) from all genres and count their occurrence. If a word in the list appears more than times, it is excluded from the wordcloud plots. Here we set to 20 and to 5. Figure  6 shows wordcloud plots of Romance, Thriller, Music and War on 28K trailers. We can observe that if words like “singer”, “applause”, “blues” appear in a trailer, it has a higher tendency to be classified as Music. On the other hand, War movies are more related to words such as “soldier”, “country”, “home”, “majesty”. More examples can be found in the supplementary.

4.3 Scene Boundary Detection

Datasets & Experiment Settings. The scene boundary detection task is evaluated on MovieNet where 318 movies are annotated with scene boundaries. Following the experiment setting of ShotCoL [10], we split the 318 movies into 190, 64, 64 movies for training, validation and test set respectively. Average Precision (AP) and Recall@0.5 are used for our evaluation metrics. The input is four sequential shots and the output is the probability that a scene boundary exists between the second and third shots. We use Binary Cross Entropy loss as the loss function and the weight for boundary versus non-boundary is 10:1 due to the data imbalance.

Model Architecture.

We adopt a three-layer perceptron classifier (number-of-shots

feature-dimension - 4096 - 1024 -2 ) as our decoder, which is same to [10]. Consistent with the genre classification task, we sample 3 frames from each shot as the shot representation. However, unlike genre classification, the data for scene boundary detection only spans the visual modality and lacks audio and language modalities. Therefore, we introduce additional pretrained features based solely on visual modality to boost the performance. Specifically, we combine CLIP features with pretrained features on Places [50] dataset as the input to our encoder444

Places is a large-scale dataset for the scene recognition task.


Evaluation Results. Table 4 reports the quantitative results of MMShot and other baselines. The scores of baselines are directly cited from [10]. From the table, we see that MMShot with CLIP features already outperforms most baselines except ShotCoL. The reason could be that ShotCoL leverages contrastive learning to learn the shot representation designed for shot similarity while CLIP features are learned for image-text similarity. However, our model gets the new state-of-art results by combining CLIP features and Places features together.

Models AP Recall@0.5
SCSA [8] 14.7 54.9
Story Graph [38] 25.1 58.4
Siamese [3] 28.1 60.1
ImageNet [12] 41.26 30.06
Places [50] 43.23 59.34
LGSS [27] 47.1 73.6
ShotCoL [10] 53.37 81.33
MMShot-CLIP (ours) 51.96 78.62
MMShot-CLIP-Places (ours) 54.45 82.21
Table 4: Quantitative results of scene boundary detection on MovieNet. See Section 4.3 for discussion.
Figure 7: Pearson correlation coefficients between film genres on MovieNet. See Section 4.4 for discussion.

4.4 Limitations and Future Work

In this paper, we adopt the binary relevance strategy to train MMShot. This strategy is easy to implement but ignores the dependencies between labels. As shown in Figure 7, we observe that the movie genres do have dependencies with each other. For example, Thriller, Crime, and Horror are more likely to appear with each other. Family movies are labeled as Animation with a high probability. In contrast, negative Pearson correlation coefficients exist between genres like Comedy and Thriller, Drama and Documentary. Based on the observation, we conclude that effectively leveraging the correlations among different genres should be helpful for movie genre classification.

Besides, we mainly investigate the effect of pretrained features from multi-modalities in our paper, which means the encoders to extract these features are frozen when training MMShot. As a result, the development of end-to-end finetuning strategies could be a potential direction to further improve our models.

5 Conclusion

In this paper, we proposed a multi-modal network based on shot information (MMShot) for movie genre classification, exploring the effect of audio and language modalities which are ignored by prior work. Since the audio is always available in our input videos, and our language information is recognized from audio waveforms, MMShot can boost performance without requiring additional datasets. In addition, we introduce a keyword extraction algorithm to effectively filter useful information from noisy captions, making the language modality be beneficial to classify genres. MMShot remarkably outperforms the state-of-the-art on genre classification, improving 1721% mAP points on MovieNet and Condensed Movies. We further generalize MMShot to scene boundary detection task, achieving the new state-of-the-art by improving 1.1% AP points. Extensive experiments are performed to demonstrate the long video analysis ability of MMShot and uncover the correlations between genres and movie elements from multiple modalities.


  • [1] M. Bain, A. Nagrani, A. Brown, and A. Zisserman (2020) Condensed movies: story based retrieval with contextual embeddings. In

    Proceedings of the Asian Conference on Computer Vision

    Cited by: 3rd item, §1, §4.1.
  • [2] D. Bamman, B. O’Connor, and N. A. Smith (2013) Learning latent personas of film characters. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 352–361. Cited by: §2.
  • [3] L. Baraldi, C. Grana, and R. Cucchiara (2015) A deep siamese network for scene detection in broadcast videos. In Proceedings of the 23rd ACM international conference on Multimedia, pp. 1199–1202. Cited by: §2, Table 4.
  • [4] P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic (2013) Finding actors and actions in movies. In Proceedings of the IEEE international conference on computer vision, pp. 2280–2287. Cited by: §2.
  • [5] D. Brezeale and D. J. Cook (2006) Using closed captions and visual features to classify movies by genre. In Poster session of the seventh international workshop on Multimedia Data Mining (MDM/KDD2006), Cited by: §1.
  • [6] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308. Cited by: §1, §3.1, §4.2, Table 1.
  • [7] P. Cascante-Bonilla, K. Sitaraman, M. Luo, and V. Ordonez (2019) Moviescope: large-scale analysis of movies using multiple modalities. arXiv preprint arXiv:1908.03180. Cited by: 1st item, §1, §2.
  • [8] V. T. Chasanis, A. C. Likas, and N. P. Galatsanos (2008) Scene detection in videos using shot clustering and sequence alignment. IEEE transactions on multimedia 11 (1), pp. 89–100. Cited by: Table 4.
  • [9] C. R. Chen, R. Panda, K. Ramakrishnan, R. Feris, J. Cohn, A. Oliva, and Q. Fan (2021) Deep analysis of cnn-based spatio-temporal representations for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6165–6175. Cited by: §2.
  • [10] S. Chen, X. Nie, D. Fan, D. Zhang, V. Bhat, and R. Hamid (2021) Shot contrastive self-supervised learning for scene boundary detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9796–9805. Cited by: §2, §4.3, §4.3, §4.3, Table 4.
  • [11] T. Cour, C. Jordan, E. Miltsakaki, and B. Taskar (2008) Movie/script: alignment and parsing of video and text transcription. In European Conference on Computer Vision, pp. 158–171. Cited by: §2.
  • [12] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Table 4.
  • [13] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2, §4.1.
  • [14] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter (2017) Audio set: an ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp. 776–780. Cited by: §3.1.
  • [15] M. Honnibal and I. Montani (2017)

    SpaCy 2: natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing

    To appear 7 (1), pp. 411–420. Cited by: §3.2.
  • [16] Q. Huang, Y. Xiong, A. Rao, J. Wang, and D. Lin (2020) Movienet: a holistic dataset for movie understanding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp. 709–727. Cited by: Figure 1, 3rd item, §1, §1, §1, §2, §2, §2, §3.1, §4.1, §4.1, §4.2, Table 1, footnote 3.
  • [17] Q. Huang, Y. Xiong, Y. Xiong, Y. Zhang, and D. Lin (2018) From trailers to storylines: an efficient way to learn from movies. arXiv preprint arXiv:1806.05341. Cited by: §1, §1, §1, §2, §2.
  • [18] J. Kim, M. Ma, K. Kim, S. Kim, and C. D. Yoo (2019) Progressive attention memory network for movie story question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8337–8346. Cited by: §2.
  • [19] Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley (2020)

    Panns: large-scale pretrained audio neural networks for audio pattern recognition

    IEEE/ACM Transactions on Audio, Speech, and Language Processing 28, pp. 2880–2894. Cited by: §3.1, §4.1.
  • [20] A. Kukleva, M. Tapaswi, and I. Laptev (2020) Learning interactions and relationships between movie characters. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9849–9858. Cited by: §2.
  • [21] J. Lei, L. Li, L. Zhou, Z. Gan, T. L. Berg, M. Bansal, and J. Liu (2021) Less is more: clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7331–7341. Cited by: §1, §2.
  • [22] Y. Meng, C. Lin, R. Panda, P. Sattigeri, L. Karlinsky, A. Oliva, K. Saenko, and R. Feris (2020) Ar-net: adaptive frame resolution for efficient action recognition. In European Conference on Computer Vision, pp. 86–104. Cited by: §2.
  • [23] S. Park, Y. Kim, M. N. Uddin, and G. Jo (2009) Character-net: character network analysis from video. In 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, Vol. 1, pp. 305–308. Cited by: §2.
  • [24] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid (2014) Category-specific video summarization. In European conference on computer vision, pp. 540–555. Cited by: §2.
  • [25] S. Protasov, A. M. Khan, K. Sozykin, and M. Ahmad (2018)

    Using deep features for video scene detection and annotation

    Signal, Image and Video Processing 12 (5), pp. 991–999. Cited by: §2.
  • [26] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In

    International Conference on Machine Learning

    pp. 8748–8763. Cited by: §3.1, §4.1.
  • [27] A. Rao, L. Xu, Y. Xiong, G. Xu, Q. Huang, B. Zhou, and D. Lin (2020) A local-to-global approach to multi-modal movie scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10155. Cited by: §2, §2, Table 4.
  • [28] Z. Rasheed and M. Shah (2003) Scene detection in hollywood movies and tv shows. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 2, pp. II–343. Cited by: §2.
  • [29] Z. Rasheed, Y. Sheikh, and M. Shah (2005) On the use of computable features for film classification. IEEE Transactions on Circuits and Systems for Video Technology 15 (1), pp. 52–64. Cited by: §1.
  • [30] D. Rotman, D. Porat, and G. Ashour (2017) Optimal sequential grouping for robust video scene detection using multiple modalities. International Journal of Semantic Computing 11 (02), pp. 193–208. Cited by: §2.
  • [31] Y. Rui, T. S. Huang, and S. Mehrotra (1998) Exploring video structure beyond the shots. In Proceedings. IEEE International Conference on Multimedia Computing and Systems (Cat. No. 98TB100241), pp. 237–240. Cited by: §2.
  • [32] A. Sadhu, T. Gupta, M. Yatskar, R. Nevatia, and A. Kembhavi (2021) Visual semantic role labeling for video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5589–5600. Cited by: §2.
  • [33] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta (2016) Hollywood in homes: crowdsourcing data collection for activity understanding. In European Conference on Computer Vision, pp. 510–526. Cited by: §2.
  • [34] Silero Team (2021) Silero Models: pre-trained enterprise-grade STT / TTS models and benchmarks. GitHub. Note: Cited by: §1, §3.1, §3.2, §3, §4.1.
  • [35] G. S. Simões, J. Wehrmann, R. C. Barros, and D. D. Ruiz (2016) Movie genre classification with convolutional neural networks. In 2016 International Joint Conference on Neural Networks (IJCNN), pp. 259–266. Cited by: §1, §1, §2.
  • [36] T. Souček and J. Lokoč (2020) TransNet v2: an effective deep network architecture for fast shot transition detection. arXiv preprint arXiv:2008.04838. Cited by: §2, §4.1.
  • [37] T. Souček, J. Moravec, and J. Lokoč (2019) TransNet: a deep network for fast detection of common shot transitions. arXiv preprint arXiv:1906.03363. Cited by: §2.
  • [38] M. Tapaswi, M. Bauml, and R. Stiefelhagen (2014) Storygraphs: visualizing character interactions as a timeline. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 827–834. Cited by: Table 4.
  • [39] M. Tapaswi, M. Bauml, and R. Stiefelhagen (2015) Book2movie: aligning video scenes with book chapters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1827–1835. Cited by: §2.
  • [40] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler (2016) Movieqa: understanding stories in movies through question-answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4631–4640. Cited by: §2.
  • [41] P. Thirard and L. Codelli (1994) Robert sklar. film, an international history of the medium, 1993;; kristin thompson, david bordwell. film history, an introduction, 1994. 1895, revue d’histoire du cinéma 17 (1), pp. 170–170. Cited by: footnote 1.
  • [42] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §2, §3.1.
  • [43] B. T. Truong and S. Venkatesh (2007) Video abstraction: a systematic review and classification. ACM transactions on multimedia computing, communications, and applications (TOMM) 3 (1), pp. 3–es. Cited by: §2.
  • [44] B. Wang, Y. Xu, Y. Han, and R. Hong (2018) Movie question answering: remembering the textual cues for layered visual contents. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    Cited by: §2.
  • [45] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §1, §2, §4.2, Table 1.
  • [46] C. Weng, W. Chu, and J. Wu (2009) Rolenet: movie analysis from the perspective of social networks. IEEE Transactions on Multimedia 11 (2), pp. 256–271. Cited by: §2.
  • [47] J. Xia, A. Rao, Q. Huang, L. Xu, J. Wen, and D. Lin (2020) Online multi-modal person search in videos. In European Conference on Computer Vision, pp. 174–190. Cited by: §2.
  • [48] M. Xu, Y. Xiong, H. Chen, X. Li, W. Xia, Z. Tu, and S. Soatto (2021) Long short-term transformer for online action detection. arXiv preprint arXiv:2107.03377. Cited by: §2.
  • [49] B. Zhou, A. Andonian, A. Oliva, and A. Torralba (2018) Temporal relational reasoning in videos. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 803–818. Cited by: §1, §1, §2, §4.2, Table 1.
  • [50] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017) Places: a 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §4.3, Table 4.
  • [51] H. Zhou, T. Hermans, A. V. Karandikar, and J. M. Rehg (2010) Movie genre classification via scene categorization. In Proceedings of the 18th ACM international conference on Multimedia, pp. 747–750. Cited by: §1, §1, §2.
  • [52] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pp. 19–27. Cited by: §2.


Appendix 0.A Implementation Details

0.a.1 Statistics of Source Videos

We present the number of source videos that we used on MovieNet and Condensed Movies in Table 5. The distribution of genres on both datasets on shown in Figure 8. From the figure, we see that the label distribution is remarkably imbalanced, validating the importance of “micro” and “macro” metrics used in our experiments.

Datasets Total Training Validation Test Type of Video
MovieNet 28,466 19,926 2,846 5,694 trailer
Condensed Movies 22,174 15,521 2,217 4,436 movie clip
Table 5: Statistics of source videos on MovieNet and Condensed Movies.
Figure 8: Distribution of genres on MovieNet (left) and Condensed Movies (right).

Appendix 0.B Additional Genre Classification Results

0.b.1 Per-genre performance

We present the per-genre performance of MMShot on MovieNet in Figure 9. We observe that though MMShot performs well on most genres, it is still difficult to correctly predict genres like Mystery, Biography, and History. The reason could be that these genres are mainly determined by higher-level semantics such as storylines that cannot be directly reflected in trailers or movie clips. Therefore, a potential improvement is developing an approach to learning higher-level semantics such as storylines, narrations, etc.

Figure 9: Per-genre performance(left: precision; right: recall) of MMShot on MovieNet.

0.b.2 Sound Event Analysis

Figure 10 provides representative sound events of movie genres to supplement the results from main paper, validating that the audio modality is a discriminative attribute to genre recognition.

0.b.3 Keyword Analysis

We plot additional wordclouds in Figure 11 as the supplement to the main paper, uncovering the correlations between keywords and movie genres.

0.b.4 Long Movie Analysis

We provide additional analysis of long movies, Titanic (1997) and Jurassic Park (1993) in Figure 12 and Figure 13 to supplement our main paper, demonstrating the long video analysis ability of MMShot.

Figure 10: Sound events of different movie genres on MovieNet.
Figure 11: Wordclouds of different movie genres on MovieNet.
Figure 12: The analysis of movie “Titanic”.
Figure 13: The analysis of movie “Jurassic Park”.