Effectively leveraging Multi-modal Features for Movie Genre Classification

by   Zhongping Zhang, et al.

Movie genre classification has been widely studied in recent years due to its various applications in video editing, summarization, and recommendation. Prior work has typically addressed this task by predicting genres based solely on the visual content. As a result, predictions from these methods often perform poorly for genres such as documentary or musical, since non-visual modalities like audio or language play an important role in correctly classifying these genres. In addition, the analysis of long videos at frame level is always associated with high computational cost and makes the prediction less efficient. To address these two issues, we propose a Multi-Modal approach leveraging shot information, MMShot, to classify video genres in an efficient and effective way. We evaluate our method on MovieNet and Condensed Movies for genre classification, achieving 17 (mAP) over the state-of-the-art. Extensive experiments are conducted to demonstrate the ability of MMShot for long video analysis and uncover the correlations between genres and multiple movie elements. We also demonstrate our approach's ability to generalize by evaluating the scene boundary detection task, achieving 1.1 state-of-the-art.


page 2

page 10

page 20

page 21

page 22


A Case Study of Deep Learning Based Multi-Modal Methods for Predicting the Age-Suitability Rating of Movie Trailers

In this work, we explore different approaches to combine modalities for ...

Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation

Temporal video segmentation and classification have been advanced greatl...

Improving Music Genre Classification from multi-modal properties of music and genre correlations Perspective

Music genre classification has been widely studied in past few years for...

Show and Recall: Learning What Makes Videos Memorable

With the explosion of video content on the Internet, there is a need for...

Truly Multi-modal YouTube-8M Video Classification with Video, Audio, and Text

The YouTube-8M video classification challenge requires teams to classify...

What You Say Is What You Show: Visual Narration Detection in Instructional Videos

Narrated "how-to" videos have emerged as a promising data source for a w...

A Unified Framework for Shot Type Classification Based on Subject Centric Lens

Shots are key narrative elements of various videos, e.g. movies, TV seri...

Please sign up or login with your details

Forgot password? Click here to reset