Rethinking movie genre classification with fine-grained semantic clustering

by   Edward Fish, et al.
University of Surrey

Movie genre classification is an active research area in machine learning. However, due to the limited labels available, there can be large semantic variations between movies within a single genre definition. We expand these 'coarse' genre labels by identifying 'fine-grained' semantic information within the multi-modal content of movies. By leveraging pre-trained 'expert' networks, we learn the influence of different combinations of modes for multi-label genre classification. Using a contrastive loss, we continue to fine-tune this 'coarse' genre classification network to identify high-level intertextual similarities between the movies across all genre labels. This leads to a more 'fine-grained' and detailed clustering, based on semantic similarities while still retaining some genre information. Our approach is demonstrated on a newly introduced multi-modal 37,866,450 frame, 8,800 movie trailer dataset, MMX-Trailer-20, which includes pre-computed audio, location, motion, and image embeddings.


page 2

page 3

page 5

page 8

page 10

page 12


Semantic Bilinear Pooling for Fine-Grained Recognition

Fine-grained recognition, e.g., vehicle identification or bird classific...

Contrastive Multi-Modal Clustering

Multi-modal clustering, which explores complementary information from mu...

X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval

Video-text retrieval has been a crucial and fundamental task in multi-mo...

Multi-Modal Reasoning Graph for Scene-Text Based Fine-Grained Image Classification and Retrieval

Scene text instances found in natural images carry explicit semantic inf...

A Case Study of Deep Learning Based Multi-Modal Methods for Predicting the Age-Suitability Rating of Movie Trailers

In this work, we explore different approaches to combine modalities for ...

Not All Negatives are Equal: Label-Aware Contrastive Loss for Fine-grained Text Classification

Fine-grained classification involves dealing with datasets with larger n...

Efficient Algorithms for Learning from Coarse Labels

For many learning problems one may not have access to fine grained label...

1 Introduction

Genres are a useful classification device for condensing the content of a movie into an easy to understand contextual frame for the viewer. However, in the field of Film Theory, genre is not perceived as a reliable descriptive label for a number of reasons. For example, Neale [Neale2012] states how genre labels are not extensive enough to cover the diversity of content within a movie and can be only relevant for the period in which they are used. Altman  [Altman1984] argues that this is because genres are in a constant process of negotiation and change, where a stable set of semantic givens is developed both through syntactical experimentation and in response to audience or cultural development. We can find thousands of films with identical genre labels but very different inter-textual, semantic content. With this in mind, we propose that genre labels should be considered a weak labelling methodology and to address this classification issue we present a self-supervised approach to finding shared inter-textual information between movies that exist outside of these restrictive genre categories.

Recent machine learning based genre classification studies, have under-explored the semantic variation that exists between genre labels [alvarez2019influence, wehrmann2017movie]. Furthermore, efforts have been made to avoid the issues that come with this poorly defined classification problem. In [alvarez2019influence], the authors show how using a broader range of distribution dates within the movie dataset results in inferior classification when predicted using low-level visual features. We also find that the movie genre classification dataset LMTD-9 [wehrmann2017movie] only features movies from before 1980, which may be in response to the more fluid nature of genre representation in the last thirty years. Lastly, we find that many genre classification papers have avoided multi-label approaches  [rasheed2005use, huang2012movie, Zhao2016], simplifying the complex relationships that exist between multiple genres [Luklow1984].

Therefore we approach genre classification as a weakly labelled problem, seeking to find similarities between the inter-textual content of movies, within the genre space. To do so, we exploit expert knowledge in the form of semantic embeddings (’experts’) as proposed in [liu2019CollabGate]

, obtaining several data modes including scene understanding, image content analysis, motion style detection and audio. Using a contextually gated approach 

[Miech2017] enables us to amplify modes that are more useful for multi-label genre-classification, and yields good results for discreet genre labelling. Then inspired by [Luklow1984, Neale2012, Altman1984], we continue to train the model self-supervised, uniquely leveraging the similarity and differences of sub sequences from within the trailers, to identify inter-textual similarities between the movies for clustering, retrieval, and genre label improvement. This has the effect of expanding genre clusters by their semantic information leading to improved clustering and retrieval .

Figure 1:

Self-supervised Genre clustering via collaborative experts. (a) is a T-SNE plot showing the output of the coarse genre encoder network. Here, trailers that share the same three genres, ‘Biography’, ‘Drama’, and ‘History’ have a high cosine similarity and are well clustered, as is the ‘Documentary’ genre. (b) illustrates the output of our fine grained genre model, where the model has separated the trailers taking into consideration their multi-modal content. In this example, the movie ’Darkest Hour’ is pushed further away from ‘Cleopatra’ and ‘Braveheart’ as they share more similar semantic content with other large scale historical action movies. In the second example the three ‘Documentary’ trailers are pushed apart with consideration to their ‘Music’ and ‘Adventure’ inter-textual signatures.

As in other works [rasheed2005use, huang2012movie, Zhao2016, wehrmann2017movie, shambharkar3DConv], we use movie trailers as a condensed representation of the content of a movie. Our work is demonstrated in a new large 37 million frame multi-label genre dataset with pre-processed expert embeddings which will be made available for improving research in this field. Our proposed work makes the following contributions:

  • (i) We introduce MMX-Trailer-20 a new 9K 37,866,450 frame dataset of movie trailers, spanning 120 years of global cinema, and labelled with up to 6 genre labels and 4 pre-computed embeddings per trailer.

  • (ii) We demonstrate the effectiveness of a multi-modal, collaboratively gated network for multi-label coarse genre classification of up to 20 genres;

  • (iii) Enable fine grained semantic clustering of genres via self-supervised learning for retrieval and exploration;

  • (iv) The extensive assessment of the performance of the representation on a wide range of genres and trailers;

2 Related Work

Coarse Movie Genre Classification: Earlier techniques in this field pertain to extracting low-level audiovisual descriptors. Huang(H.Y.) et al. [huang2007film] used two features - scene transitions and lighting. In contrast, Jain & Jadon [jain2009movies]

applied a simple neural network with low-level image and audio features. Huang(Y.F.) & Wang 


used the SAHS (Self Adaptive Harmony Search) algorithm in selecting features for different movie genres learnt using a Support Vector Machine with good results. Zhou et al. 

[zhou2010movie] predicted up to four genres with a BOVW clustering technique. Musical scores have also shown to offer a useful mode for classification as in the work of Austin et al. [austin2010characterization]

who predicted genre with spectral analysis using SVMs. More recent work has utilised deep learning and convolutional neural networks for genre classification. Wehrmann & Barros 

[wehrmann2017movie, wehrmann_2016_deep] used convolutions to learn the spatial as well as temporal characteristic-based relationships of the entire movie trailer, studying both audio and video features. Shambharkar et al. [shambharkarmultimodal]

introduced a new video feature as well as three new audio features which proved useful in classifying genre, combining a CNN with audio features to provide promising results. While 

[shambharkar3DConv], employed 3D ConvNets to capture both the spatial as well as the temporal information present in the trailer. The ‘interestingness’ of movies has also been predicted by audiovisual features [AhmedInterestingness2018]. Additional features, including text and other metadata, have been combined using simple pooling in more recent work by Bonilla [cascante2019moviescope] to analyse the complementary nature of different modalities.

Figure 2: An overview of the approach. Where is a clip extracted and is a sequence of 9 clips, constructed from the concatenation of each output from the Collaborative Gating Units. The video sequences are concatenated and passed through the bottleneck MLP

which generates the feature embedding vector. Further training for classification encourages this embedding to capture coarse genre information. After training, the whole network is fine-tuned using the self-supervised approach described in Sec

4.4 to encourage to highlight similar inter-textual information between samples for fine-grained clustering. Dots here represent broad genres such as Action, Adventure and Sci-Fi. The fine-grained network separates the individual genres, drawing similar films together while retaining some of the broader genre information.

Supervised Activity classification: Given that movies are concatenations of image frames, the field of activity recognition and classification is also relevant. Early seminal works used either 3D (spatial and temporal) convolutions  [halevy2009unreasonable] or 2D convolutions (spatial) [he2016identity], both utilising a mono-mode of appearance information from RGB frames. Simonyan & Zisserman [radford2018improving] address the lack of motion features captured by these architectures, proposing two-stream late fusion that learnt distinct features from the Optical Flow and RGB modalities, outperforming single modality approaches. Later architectures have focused on modelling longer temporal structure, through the consensus of predictions over time [kiros2014unifying, venugopalan2015sequence, yamamoto2003topic] as well as inflating CNNs to 3D convolutions [carreira2017quo_kinetics], all using the two-stream approach of late fusion of RGB and Flow. Given the temporal nature of video, the computation can be high, therefore recently the focus has been centered on reducing the high computational cost of 3D convolutions [mithun2018learning, hauptmann2002multi, xu2016msr], yet still showing improvements when reporting results of two-stream fusion [xu2016msr].

Self Supervision for Activity Recognition: Given the amount of data video presents, labelling is a challenging and time-consuming activity. Therefore self-supervised methods of learning have been developed. Learning representations from the temporal [fernando2017self, wei2018learning, miech2017learning] and multi-modal structure of video [arandjelovic2017look, korbar2018cooperative], are examples of self-supervised learning, leveraging pre-training on a large corpus of unlabelled videos. Methods exploiting the temporal consistency of video have predicted the order of a sequence of frames [fernando2017self] or the arrow of time [wei2018learning]. Alternatively, the correspondence between multiple modalities has been exploited for self-supervision, particularly with audio and RGB [arandjelovic2017look, korbar2018cooperative, owens2018audio].

3 Methodology

This section outlines our proposed methodology for both coarse classification and finer grained clustering and retrieval. In Fig. 2, we present an overview of our approach. Using four pre-trained multi-modal ‘experts’, we extract audio and visual features from the input video. To enable genre classification, a collaborative gating model [liu2019CollabGate, Miech2017], learns to emphasise or downplay combinations of these features to minimise a multi-label loss (. This training has the effect of learning the most valuable combination of modes for each label. After we achieve high accuracy for multi-label classification, we encourage the network to develop fine grained semantic clusters through self-supervised training. To achieve this, inspired by the approach of [chen2020simpleSIMCLR], we maximise the cosine similarity between sub-sequences within the trailers embedding vectors obtained from the same movie trailer (positive examples) while pushing negative sequence pairs further apart in the feature space.

Given a set of videos v, each video is made up of a collection of sequences, s, so , where there are sequences in a video. Where each sequence is formed of clips, giving . Ideally, the feature embeddings for all clips should all lie close together as they will have the same class labels, while those from other videos with different labels should lie far apart. The aim of this work is to create a function that can map a clip from a video sequence s, where to a joint feature space that respects the difference between clips. To construct our function , we rely on several pre-trained single modality experts, , with experts and is the ’th expert. These operate on the video or audio data and will each project the clip to an individual variable length embedding. Given that the embeddings are of a variable length, we aggregate the embeddings along their temporal component to form a standard vector size. Any temporal aggregation could be used here, but we use using average pooling for the video based features. While for audio, we implement NetVlad [Arandjelovic2018]

, inspired by the vector of locally aggregated descriptors, commonly used in image retrieval. To enable their combination in the following collaborative gating phase, we apply linear projections to transform these task-specific embeddings to a standard dimensionality.

3.1 Collaborative Gating Unit

The Collaborative Gating Unit as proposed in  [Miech2017] aims to achieve robustness to noise in the features through two mechanisms: (i) the use of information from a wide range of modalities; (ii) a module that aims to combine these modalities in a manner that is robust to noise.

There is a two-stage process to learn the optimum combination of the expert embeddings for noise robustness: define a single attention vector for the ’th expert; then modulate the expert responses with the original data.

To create the ’th expert’s projection , we use the approach first proposed by [santoro2017simple] for answering virtual questions. The attention vector of an expert projection will consider the potential relationships between all pairs associated with this expert, as defined in eq. 1.


This creates the projection between expert and , where is used to infer the pairwise task relationships while maps the sum of all pairwise relationships into a single attention vector , and v is the set of sequences. Both and

are defined as multi-layer perceptrons (MLPs). To modulate the result, we take the attention vectors

and perform element wise multiplication (Hadamard product) with the initial expert embedding vector which results in a suppressed or amplified version of the original expert embedding. Each expert embedding is then passed through a Gated Embedding Module (GEM) [miech2018learning] before being concatenated together into a single fixed length vector for the clip. We capture 9 clip embeddings before concatenating and passing through an MLP to obtain a sequence embedding. These sequence representations are then concatenated together before being passed through a bottleneck layer which learns a compact embedding for the whole trailer.


3.2 Coarse Grained Genre Classification

The trailer embedding obtained from the collaborative gating unit

can be trained in-conjunction with genre labels to enable classification. Given each trailer can have up to six genre labels, a Binary Cross Entropy Logits Loss is minimised. First the sequence embeddings

are summed over a trailer and then projected via an MLP to produce a logits embedding. We then proceed to minimise a Binary Cross Entropy Logits Loss until convergence. This loss combines a Sigmoid layer and the Binary Cross Entropy Loss as this is more numerically stable than using a Sigmoid followed by a Binary Cross Entropy Loss.


Where is our genre class label, is the number of samples in the batch and

is the weight or probability of the positive answer for the class

, is the logits and is the target. With this method it is also possible to perform genre classification on each sequence , by adjusting so that becomes the logits embedding. While the gated encoder accuracy is degraded slightly by this technique (as outlined later in the ablation studies, see Tbl. 2) it is possible to identify different sub genres at a sequence level. For example one could identify specific ‘Adventure’ sequences in a movie that only has the genre label of ‘Action’.

We show in the results section later how collaborative gating is effective at improving the prediction task of user-defined labels in a fully supervised manner. By analysing data in the bottleneck vector, we can see how the model is able to capture important semantic information for genre clustering. However, we can also see that movies with similar genre labels are grouped, even if the style or content of the movie varies. To create a more granular representation of the genre space, we continue to allow the model to train self-supervised.

3.3 Fine Grained Semantic Genre Clustering

As discussed in the introduction, discreet genre labels are restrictive and only offer a broad representation of the content of a video. We aim to find finer grained semantic content by identifying similarities in the sound, locations, objects, and motion within the videos. To achieve this, we extend the pre-trained coarse genre classification model with a self supervised contrastive learning strategy using a normalised temperature-scaled cross-entropy loss (NT-Xent) as proposed by Chen [chen2020simpleSIMCLR], . In [chen2020simpleSIMCLR], image augmentations are used as comparative features to fine tune the embedding layer of their classification network. The goal is to encourage greater cosine similarity between embeddings obtained from the same image, while forcing the negative pairs apart. We uniquely extend this method to video, by splitting each movie trailer into two equal lengths of sequences, and using the embeddings of these sequences as the representation pairs .


Here , and are the feature representations and m(.) represents a projection head encoder formed from MLPs, is a temperature parameter set at 0.5 and is the cosine similarity metric. and are two embedding vectors obtained from the same video as described above, while is an embedding vector from another video. Here, the loss will enforce closer in cosine similarity to but further from . This process is illustrated in the overview Fig. 2.

Pairing each embedding vector from the video s with all other embedding vectors from other videos, will maximise the number of negatives. As a result, for each video, we get negative pairs — where is the number of videos in the dataset. Therefore in training we mini-batch sequences, which comprises of sequences . The overall contrastive loss is computed as shown in the equation below


After training, the MLP projection head () is removed and we use the bottleneck layer of the collaborative gating model as a pre-trained embedding projection network. The fine-tuning using a contrastive loss encourages the bottleneck layer to retain some coarse genre information while finding similar inter-textual elements in other trailers. This leads to a more diverse clustering of samples, identifying sub-label information within the original label clusters.

4 Results and Discussion

This section introduces the MMX-Trailer-20 dataset in Sec. 4.1, followed by specific architecture and implementation details in Sec. 4.2. Key results for coarse genre classification are presented in Sec. 4.3 with an ablation study of the method’s components and comparison against other baselines. Fine grained semantic clustering of the dataset is then presented in Sec. 4.4 (on self-supervised retrieval).

4.1 MMX-Trailer-20: Multi-Model eXperts Trailer Dataset

There are several datasets upon which previous works test. However, to capture the scale and variability of a dataset is challenging, especially in terms of diversity of genre, size of dataset and year of distribution. Tbl. 1 shows the comparison in size and labelling between recent works in genre classification.

width=0.48 Dataset Video Number Frames Label Num. Genre/ Source Trailers Source Genres Trailer Rasheed [rasheed2005use] Apple 101 - - 4 1 Huang [huang2012movie] Apple 223 - IMDb 7 1 Zhou [Zhao2016] IMDb+Apple 1239  4.5M IMDb 4 3 LMTD-9 [wehrmann2017movie] Apple 4000  12M IMDb 9 3 Moviescope [cascante2019moviescope] IMDb 5000  20M IMDb 13 3 MMX-Trailer-20 Apple+YT 8803  37M IMDb 20 6

Table 1: The details of other Movie genre datasets

As shown, most datasets are small with limited numbers of genre labels, both in terms of variability and the number assigned to a single trailer. Moviescope [cascante2019moviescope] is the closest to the proposed dataset, with 3 genre labels and 5000 trailers. However, we increase the number of trailers and labels per trailer, while increasing the number of frames available by order of magnitude. Furthermore, we will make available all pre-computed expert embeddings for every scene. The collection totals 8803 movie trailers drawn from Apple Trailers and YouTube, comprising 37,866,450 individual frames of video. The statics of the dataset can be seen in Fig. 4, for example, a wide range of genres exist, and each trailer is labelled with on average at least 3 genres, while the year of the trailers is diverse from 1930s to the present day.

Figure 3: Examples of the diversity of trailers within the MMX-Trailer-20 dataset
(a) Freq of genre in dataset
(b) Num of labels/trailer
(c) Scene length
(d) Year of distribution
Figure 4: MMX-Trailer-20 Dataset statistics (best zoomed in).

Every trailer is a compact encapsulation of the full movie through a short 2 to 3 minute video, and we can collect a weak proxy of genre classification by matching the trailer to its user generated entry on the website On IMDb, users can select up to six genre labels for each trailer. The dataset has 20 genres - Action, Adventure, Animation, Comedy, Crime, Documentary, Drama, Family, Fantasy, History, Horror, Music, Mystery, Science-Fiction, Western, Sport, Short, Biography, Thriller and War. Qualitative examples illustrating the variety of the dataset are shown in Fig.3.

Processing of the Dataset: The dataset is pre-processed, with scene detection performed using the PyScene Detect [scenedetectionoverview], extracting individual clips from each trailer. While, we remove the first and last frames to mitigate poor scene detection. Audio is extracted as a mono 16bit Wav file at 16khz using FFmpeg. To compute motion frames, we use Dual Optical Flow as introduced in [zach2007duality] and outlined in the implementation of [TVL1Implementation], before passing the optical flow images via the motion expert encoder. Extracted frames are also passed to the scene and appearance encoders. We partition the dataset into 6047 trailers for training, 754 for validation and 754 for testing totalling 7555 trailers. The number of trailers used for evaluation is 1248 less than the dataset as we exclude trailers that have less than ten clips. This is done so we can maintain constant batch sizes at a minimum sequence size.

4.2 Implementation Details

Expert Features:

To capture the rich content of the trailer, we draw on several existing powerful representations that are present in movie trailers, Appearance, Audio, Scene and Motion. The Appearance feature is extracted using an SENet154 model [hu2018squeeze]

, pre-trained on ImageNet 

[imagenet_cvpr09] for the task of image classification, creating a embedding. The Scene feature is computed on a per frame basis from a ResNet-18 model [he2016deep] pre-trained on the Places365 [zhou2017places] dataset, returning a embedding. The Motion of the clip is encoded via the I3D inception model [carreira2017quo_kinetics] and a 34-layer R(2+1)D model [tran2018closer] trained on the kinetics-600 dataset [carreira2017quo_kinetics], producing a embedding. The Audio embeddings are obtained with a VGG style model, trained for audio classification on the YouTube-8m dataset [Abu-El-Haija2016] resulting in a

embedding. To aggregate the features extracted on a frame-wise basis, for appearance, scene and motion embeddings, we average frame-level features along the temporal dimension to produce a single feature vector per clip per feature. For audio, we aggregate the features using a vector of locally aggregated descriptors as outlined in

[arandjelovic2017look, Arandjelovic2018]. We then average each expert feature to the same dimension of before passing to the collaborative gating unit described above.

Training details:

We implement our model using the PyTorch library, and hyper-parameters were identified using coarse to fine grid search. For supervised coarse genre classification, the Binary Loss is reduced over 200 epochs using the Adam Optimiser 

[Kingma2015] with AMSGrad [Reddi2018], and with an initial learning rate of 0.00003, and a batch size of 32 samples. In the self-supervised training, we adjust the learning rate to 0.0001. We pass ten epochs of the samples through the encoder before reducing the learning rate using cosine annealing as proposed in [chen2020simpleSIMCLR]. We continue to fine-tune the network for 50 epochs. Once the semantic encoder has been fine-tuned, we remove the projection head network and then use the output of the bottleneck layer at run time.

width=0.98 Model Actn Advnt Animtn Bio Cmdy Crme Doc Drma Famly Fntsy Hstry Hrror Mystry Music SciFi Wstrn Sprt Shrt Thrll War Support 130 197 46 13 224 102 87 267 117 115 44 104 41 86 107 181 30 45 12 21 - - - - Random 0.29 0.41 0.11 0.03 0.46 0.24 0.21 0.52 0.27 0.26 0.11 0.24 0.1 0.2 0.25 0.39 0.08 0.11 0.03 0.05 0.318 0.134 0.19 1 Scene [he2016deep] 0.43 0.55 0.74 0 0.49 0.38 0.63 0.55 0.51 0.28 0.24 0.42 0.3 0.28 0.41 0.51 0.22 0.19 0.11 0.33 0.434 0.489 0.437 0.48 Audio [Abu-El-Haija2016] 0.47 0.51 0.40 0.10 0.61 0.38 0.58 0.55 0.51 0.37 0.11 0.34 0.39 0.30 0.35 0.55 0.16 0.15 0.13 0.12 0.454 0.449 0.400 0.537 Motion [carreira2017quo_kinetics] 0.5 0.59 0.74 0 0.62 0.33 0.63 0.56 0.55 0.36 0.2 0.38 0.45 0.24 0.37 0.57 0.23 0.14 0.10 0.13 0.463 0.487 0.448 0.494 Image [hu2018squeeze] 0.48 0.63 0.79 0.12 0.65 0.41 0.60 0.59 0.55 0.42 0.25 0.47 0.42 0.29 0.50 0.54 0.34 0.19 0.12 0.31 0.516 0.554 0.493 0.572 Image + Audio 0.52 0.63 0.78 0.15 0.65 0.42 0.68 0.6 0.63 0.46 0.25 0.50 0.51 0.34 0.49 0.59 0.38 0.28 0.12 0.42 0.544 0.558 0.476 0.65 Image + Motion 0.59 0.64 0.78 0 0.59 0.39 0.66 0.6 0.6 0.5 0.29 0.54 0.53 0.25 0.52 0.57 0.4 0.2 0.24 0.12 0.535 0.553 0.511 0.583 Image + Scene 0.52 0.61 0.80 0.12 0.61 0.37 0.65 0.62 0.58 0.49 0.15 0.51 0.49 0.37 0.48 0.56 0.43 0.26 0.12 0.46 0.531 0.539 0.490 0.600 Naive Concat 0.56 0.61 0.64 0.09 0.64 0.35 0.69 0.60 0.58 0.39 0.19 0.49 0.45 0.21 0.48 0.6 0.39 0.28 0.27 0.41 0.525 0.497 0.522 0.551 MMX-Trailer-20 0.62 0.69 0.71 0.11 0.71 0.53 0.73 0.62 0.64 0.51 0.34 0.56 0.60 0.45 0.50 0.64 0.30 0.11 0.13 0.55 0.597 0.583 0.554 0.697

Table 2: Coarse genre classification of the MMX-Trailer-20 dataset. Across differing expert features and combinations methods (note )

Evaluation Metrics: We use the standard retrieval metrics as proposed in prior work [dong2016word2visualvec, Miech2017, mithun2018learning]

. Given the variance of the frequency of occurrence of the genre labels in the dataset we employ the following metrics designed to cope with unbalanced data:

(micro average), (macro average), and (weighted average). Each measure emphasises different aspects regarding the method’s performance. The measure averages the areas of all labels, which causes less-frequent classes to have more influence in the results. To ensure that we are performing well across all categories, even for those that have less training samples or are more difficult to predict. uses all labels globally, which makes high-frequency classes have greater influence in the results, ensuring that we obtain overall good results across all samples in the dataset. Finally, is calculated by averaging the area under precision-recall curve per genre, weighting instances according to the class frequencies, allowing each sample to be measured independently from the whole set, and then gives us an averaged score. We also show weighted Precision () and weighted Recall () and weighted F1-Score (), for all metrics higher is better.

4.3 Coarse Grained Genre Classification Results

We analyse the performance of our approach both quantitatively and qualitatively for both classification and self-supervised retrieval on MMX-Trailer-20, the trailer dataset. Tbl. 2 illustrates the quantitative performance of the coarse genre prediction model MMX-Trailer-20 intra genre, and the global metrics. The table also shows the random performance, which will vary according to the frequency of the genre in the dataset. and also identity random performance for the metrics. Also, it explores the influence of each of the individual experts on the coarse genre classification task. Individual experts are passed directly through to the first MLP, while pairs are collaboratively gated as outlined in Sec. 4.2. From these results, we can see that the image expert is most valuable for genre classification and becomes more effective when combined with motion. Using collaborative gating yields a 10% increase in basic fusion through concatenation. Audio and Scene are the weakest experts for the classification task, which may be due to features that are not genre specific such as dialogue, and external environments. We find all Visual experts perform best on Animation, most likely due to its unique style from the other trailers while the Audio expert performs better on Comedy and Sport. To identify the importance of the collaborative gating units, we compute a naive concatenation of the feature embeddings from the experts passed through an MLP layer (Naive Concat). This is shown to have a 10 point reduction compared to using the gating to aggregate the features, illustrating the importance of the learnt collaborative gating framework.

To attempt a comparison to other approaches, Tbl. 3 shows the best performance of other approaches on different datasets. Our model, MMX-Trailer-20 uses up to 6 genre labels per sample from a total of 20 genres, double most other approaches and will affect the random baseline which is nearly half that of the 9 genre datasets. To contextualise our method with others we compare previous approaches including video low level features (VLLF[rasheed2005use], audio-visual features (AV[huang2012movie, cascante2019moviescope], audio-visual features with convolutions over time CTT-MMC [wehrmann2017movie], and an LSTM model that uses visual feature data in a standard sequence analysis approach as implemented for comparison in [wehrmann2017movie]. From the results in Tbl. 3 we show that our model performs better than low-level features as well as the LSTM model. We do not improve performance on other audiovisual approaches which fine-tune pre-trained networks in an end to end manner [wehrmann2017movie, cascante2019moviescope] which use a far smaller subset of genre and labels in their older datasets.

width=0.48 Method no genres no labels Random 9 Class 9 3 0.206 0.204 0.294 Random 20 Class 20 6 0.134 0.130 0.208 VLLF [rasheed2005use] 9 3 0.278 0.476 0.386 AV [huang2012movie] 9 3 0.455 0.599 0.567 LSTM [wehrmann2017movie] 9 3 0.520 0.640 0.590 CTT-MMC [wehrmann2017movie] 9 3 0.646 0.742 0.724 Moviescope [cascante2019moviescope] 13 3 0.703 0.615 - Proposed MMX-Trailer-20 20 6 0.456 0.589 0.583

Table 3: Comparison of our proposed approach with the other methods for genre classification.

4.4 Fined Grained Genre Exploration

While the coarse genre classification is interesting, in general, discreet labels are limited in providing a full understanding of complex trailers. It is challenging to quantify this fully. However, we evaluate the effectiveness of the self-supervised fine grain genre learning by comparing the cosine similarity between embedding trailer vectors before and after being processed by the fine-grained self-supervised network. This is visualised in a T-SNE plot in Fig. 1, where the colours indicate the primary genre. Fig. 1(a) shows the learnt embedding for the coarse genre classification, where tight genre clusters are formed. Fig. 1(b) is after the self supervised training of the model, where we can see how the clusters have broken up into an overlapping distribution as genres are separated depending on the multi-modal content present in the trailer. We have identified three trailers (Cleopatra, Braveheart, and Darkest Hour) which share the triple genre classification of Drama, Biography, History (identified by the three shapes). These are correctly labelled by the coarse genre encoder, have a high cosine similarity and in Fig. 1(a), are spatially close in the coarse genre T-SNE plot. In Fig. 1(b), after self-supervised training, the trailer embeddings have higher cosine similarity to other genres. For example, we find that Cleopatra is drawn closer to Adventure films featuring deserts and orchestral scores (Lawrence of Arabia is one example). Braveheart shares a high cosine similarity with medieval and mythological trailers featuring large scale battles, while Darkest Hour moves towards a cluster featuring historical thrillers such as ’The Imitation Game’. This effect is quantified in Fig. 6, which shows the results of the silhouette score [SilhouetteScore] of the embedding space during the fine grained training phase. The decreasing score shows that the tight but restrictive genre classification of the coarse model are being broken and genre overlapping is occurring as the training continues.

Figure 5: Representative retrieval results for 5 queries on the Self-Supervised feature space. We find that the model is able to retrieve trailers with greater contextual awareness than is possible from just classification or Unsupervised training.
Figure 6: Silhouette score [SilhouetteScore] of the coarse encoder output and then the following 50 epochs of fine-tuning to develop the fine grained model. The decreasing score shows that the tight discrete genre clusters are separating and overlapping more as we fine-tune the network.

We can also show illustrative retrieval results. In Fig. 5 we provide five further retrieval examples. The first example Trolls (2016) not only retrieves its sequel, Trolls World Tour (2020) but also retrieves animated family movies featuring monsters. Retrieval four, Mega Shark Vs Giant Octopus (2009), has highest cosine similarity to other sea-monster and environmental disaster movies such as Bemuda Tenticles (2014) and The Meg (2018). In retrieval 5, Seethamma Vakitlo Sirimalle Chettu (2013), we discover that the model clusters other Telugu Language Films demonstrating the model’s ability to identify a cultural context within genre clusters. However, we also find Bridget Jones’ Diary (2001) within this cluster, suggesting that the cluster has not been completely isolated from other romantic comedies. These results demonstrate greater depth and nuance when compared to retrieval using the coarse genre classifier. It is important to note that in the returned results the genre has not been determined using low-level pixel information. Instead, they uncover an interesting semantic and inter-textual relationship between style, sound, and content of different movie genres and production techniques.

4.5 Augmentation of Genre Labels

It is also possible to use the self-supervised network to augment and improve the overall labelling of the original movie trailers, as shown in Tbl. 4. To test this, we compared genre labels produced by IMDb to find mislabelled examples in the IMDb dataset. We then asked the network to produce labels based on a sigmoid threshold of 0.30. The model was able to extend and match the IMDb labels and also offered additional labels which are logical concerning the trailer.

width=0.48 Movie IMDb labelled Genres Predicted Genres 101 Dalmatians Action, Family Adventure,Comedy, II Family,Fantasy,Animation 300 Action, Drama Action, Adventure, Rise of an Empire Drama, Fantasy Alien- Horror, Sci-Fi, Action,Adventure Covenant Thriller Horror, Sci-Fi Company Of Heroes Drama,War War, History, Drama, Action Independence Day Action, Action, Sci-Fi, Resurgence Sci-Fi Adventure, Thriller Laws of Comedy Comedy, Crime, Attraction Drama Leprechaun Comedy, Fantasy, Adventure, Comedy, Returns Horror Fantasy, Horror Santa Paws 2 Family Family, Comedy, Adventure, Fantasy The Hobbit: The Battle Action, Adventure, Fantasy, of the 5 Armies Adventure, Action, Sci-Fi The Land Before Time Family Adventure,Fantasy VIII Adventure Family, Animation fantasy Fantasy

Table 4: Example of the improved genre labelling of movies. Original genre is the label sourced from IMDb and Predicted Genres is the result of proposed model. Blue indicates additional predicted labels.

5 Effect of sequence length

In [wehrmann_2016_deep] it’s shown how capturing temporal information using both 3D convolutions and LSTM’s can help with the genre classification task. To see if temporal information could be retained through concatenation of scenes and sequences, we experimented with a number of scene length variations. Fig 7 shows how we extracted clips and constructed sequences. First we used the scene detection method as outlined in the paper to extract individual clips before performing feature extraction and pooling to create equal 1 x 768 embeddings for every mode. Following collaborative gating, each attention vector is then concatenated into a sequence embedding. The concatenated sequences are then passed to the MLP before being concatenated with all other sequences to form a feature embedding for the whole trailer. In the case of self-supervised learning we select random sequences from the same trailer for comparison.

Figure 7: Splitting of trailers into clips and sequences. Here a sequence length of five is shown but a length of nine was used in the implementation.
Figure 8: Representative results for multi-label classification on single scenes. Green represents results where the model predicts the correct number and correct labels for the scene. Black indicates where the model suggests alternative genres for the scene. We found that the model was able to make adequate guesses at the individual scene genre when not presented in context with the whole trailer. For example scene 44 from Point Break (bottom left) makes the prediction ‘Adventure, Action’ which would be a good prediction given the image, camera motion, and audio of the scene.
Figure 9: Precision-recall curves for each expert over all labels.
Figure 10: Additional retrieval results obtained from the bottleneck embedding layer after training for coarse genre classification, and after fine-tuning with the fine-grained self-supervised network.
Sequence Length
1 0.456 0.451 0.475 0.484
5 0.493 0.518 0.428 0.625
9 0.564 0.583 0.554 0.611
20 0.495 0.503 0.493 0.576
Table 5: The effect of sequence length on classification accuracy across a number of metrics.

In Tab 5 we show the effect of sequence length on classification accuracy across a number of metrics. We find that longer sequences assist the model in making more accurate genre predictions suggesting that temporal information is captured through the concatenation of scenes. After a sequence length of 10, however, we notice that accuracy begins to decrease. We select a sequence length of 9 scenes for the model to ensure we can use as much as the dataset as possible, without compromising on performance.

In Fig 8 we can see that our model makes good predictions on individual scenes and offers reasonable guesses considering the content of the scene in isolation of the whole trailer. This explains why the model performs poorly with shorter sequences. We also notice how genre predictions change over the duration of a trailer on a scene by scene basis. This is even more prevalent in modern movies, where genre fluidity is more common or where other genres are referenced for narrative and stylistic effect.

6 Effect of individual experts

In Fig 9 we show the precision-recall curves for each label and modal expert. As might be expected ‘Animation’ is the best performing label on visual experts with ‘Comedy’ performing well over both audio and visual experts. ‘Documentary’ is best identified by the scene and audio experts, perhaps because of the additional dialogue and use of establishing shots in ‘Documentary’ trailers. While we expected the audio expert to perform best on the ‘Music’ label, we find that image and scene experts perform just as well. This may be due to the image expert identifying instruments and the scene expert associating music with auditoriums and stadiums. The influence of different experts over the labels demonstrates the advantage of using a collaboratively gated, multi-modal approach.

7 MLP Dimensions

To assist in implementation we provide further description of the multi-layer perceptrons used throughout the network.

  • After scene detection and feature extraction, the collaborative gating module returns an L2 Normalised feature vector of dimension 1 x 128 which encompasses the scaled multi-modal information for that individual clip.

  • As we are working with 9 clips per sequence in the implementation, we concatenate the 9 x 128 vectors to create a 1 x 1152 sequence vector. This is then passed through the

    two-layer MLP with ReLU activations, and resulting in a sequence vector of size 1 x 256.

  • Each sequence is then concatenated to produce a 1 x 10240 trailer vector. The size is a result of 40 sequences being concatenated. represents the bottleneck layer where the 1 x 10240 embedding is processed via another MLP of two layers with ReLU activations generating a feature embedding of shape 1 x 2048.

  • For genre classification, the MLP

    is defined by 2 layers with ReLU activation on the first layer of size 2048 x 1024. The final layer produces a logits vector of size 1 x 21. To asses the accuracy of the model during training we pass the logits to a sigmoid function and return each class activated over a threshold of 0.3.

  • For fine-tuning the 1 x 2048 embedding vector is instead processed via two MLP’s; and resulting in an embedding vector of shape 1 x 128.

  • Following training and fine-tuning, embeddings of shape 1 x 2048 are taken from the bottleneck layer and used as trailer representations to compare cosine similarity.

8 Conclusion

While previous works have shown the effectiveness of convolutional neural networks and deep learning for genre classification, these methods do not address the unique inter-textual differences that exist within these discreet labels. Using a collaborative gated multi-modal network, we show that genre labels can be subdivided and extended by discovering semantic differences between the videos within these categories. By first training a genre classifier, rather than training a self-supervised model from scratch, we encourage the video embeddings to retain genre specific information. Continuing to train unsupervised introduces contextual awareness of the multi-modal content, and augments the embedding to closer align with similar movies while retaining genre specific information. We hope that the ability to cluster videos in this way will assist in the retrieval and classification of film styles, and benefit researchers working in film theory, archiving, and video recommendation.