MovieCuts: A New Dataset and Benchmark for Cut Type Recognition

09/12/2021 ∙ by Alejandro Pardo, et al. ∙ King Abdullah University of Science and Technology adobe 0

Understanding movies and their structural patterns is a crucial task to decode the craft of video editing. While previous works have developed tools for general analysis such as detecting characters or recognizing cinematography properties at the shot level, less effort has been devoted to understanding the most basic video edit, the Cut. This paper introduces the cut type recognition task, which requires modeling of multi-modal information. To ignite research in the new task, we construct a large-scale dataset called MovieCuts, which contains more than 170K videoclips labeled among ten cut types. We benchmark a series of audio-visual approaches, including some that deal with the problem's multi-modal and multi-label nature. Our best model achieves 45.7 suggests that the task is challenging and that attaining highly accurate cut type recognition is an open research problem.



There are no comments yet.


page 1

page 3

page 5

page 8

page 11

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Cut Type Recognition Task. Our goal is to predict the types of cuts in a video clip. Two shots A and B, with their time-aligned audio, compose a cut. Cuts are designed to preserve audio-visual continuity [39]

across time, space, or story, and can be classified


Professionally edited movies use what is known as film grammar [1] as conventions that rule the structure of a film. Thus, a scene is a section of a motion picture in a single location and continuous-time made up of a series of shots [25]. A shot is a set of contiguous frames from individual cameras from varying angles [25]. Finally, a shot transition or Cut is the most basic video editing device, it makes the transition from one shot to another [27, 46]. Cuts create a change of perspectives, highlight emotions, and help to advance stories [27, 46]. Consequently, and due to the advances in digital editing, their usage in movies has steadily increased over time [11]. Cuts in professionally edited movies are not random but rather have language, structures, and a taxonomy [1]. Each cut in a movie has a purpose and a specific meaning. Thus, to understand movie content, one has to understand the cuts. Tsivian et. al. introduced a platform called Cinemetrics [11]

to analyze movies by analysing their cut frequency. However, having movies with dozens of cuts per scene, it is rather a difficult task to analyze and understand the cuts. Despite their importance in editing, widespread adoption, and meaning in movies, research to understand creative cuts remain unexplored in the computer vision community. We argue that automatically recognizing and understanding cut types would make a step towards decoding video editing’s core principles. Furthermore, recognizing cut types can unleash new experiences to the video editing industry, such as movie analysis for education, video re-editing, virtual cinematography, machine-assisted trailer generation, among others.

Figure 1 illustrates the cut type recognition task introduced in this work. A Cut is composed by two adjacent shots and the transition between them. Cuts are not only made of visual frames but also of its time-aligned sound stream. In many situations, the sounds and audio drive the cuts, shaping the meaning of a cut. Our goal is then to recognize the cut type by analyzing the clip’s audio-visual information across shots. Multiple research challenges emerge from this new multi-shot video understanding task. First, there is a need for a high-level understanding of visual and audio relationships over time to identify the intended cut type. For instance, to recognize the J Cut type in example 1, one needs a detailed audio-visual inspection to associate that the sound in Shot B corresponds to the actor’s voice in Shot A. However, Reaction Cuts from figure 1 need mainly a visual understanding of facial expressions. Furthermore, a cut may belong to multiple labels that require orthogonal analyses to make correct predictions. We argue these challenges can promote the development of new architectures that address the multi-modal and multi-shot nature of the problem.

Understanding the audio-visual properties of movies has a long-standing track of interest [28, 14, 36, 47]. The community have developed methods to recognize characters and speaker [31, 13, 8], events and actions [29, 14, 18], story-lines [2, 47, 21], shot-level cinematography properties such as shot-scale and camera motion [34, 9], and mine shot-sequencing patterns [52, 51, 46]. While these approaches have set an initial framework for understanding editing in movies, there is still a lack of automated tools that understand the most basic and used editing technique, the cut.

This work aims to study and bootstrap research in cut type recognition. To do so, we introduce MovieCuts, a new large-scale dataset with manually curated multi-label cut type annotations. Our new dataset contains more clips (with cuts) labeled among ten different categories. We survey professional editors to define our taxonomy and qualified annotators to label the cut type categories. Our MovieCuts dataset offers the opportunity to benchmark core research problems such as multi-modal analysis, learning from long-tailed distributions, and multi-label classification. We benchmark a variety of audio-visual baselines. While we observe improvements by leveraging recent techniques for audio-visual blending [49] and long-tailed multi-label learning [53], there is ample room for improvement, and the task remains an open research problem.

Contributions. Our quest is to bootstrap research in cut type understanding. Our contributions are threefold:

We introduce the cut type recognition task. To the best of our knowledge, our work is the first to address and formalize the task from a machine learning perspective.

(2) We collect a large-scale dataset containing qualified human annotations that verify the presence of different cut types. We do an extensive analysis of the dataset to highlight its properties and the challenges it hinders. We called this dataset MovieCuts (Section 3).
(3) We implement multiple audio-visual baselines to establish a benchmark in cut type recognition (Section 4).

2 Related Work

Edited Content in Video Understanding. Edited video content such as movies has been a rich source of data for general video understanding. Such video sources contain various human actions, objects, and situations occurring in people’s daily life. In the early stages of action recognition, Bojanowski proposed a dataset called Hollywood Human Actions (HOHA) [28], which contains short clips from 32 Hollywood movies with annotations for human-action recognition. Another group of works used a limited number of films to train, and test methods for character recognition [37], human action localization [12], event localization [29], and spatio-temporal action and character localization [4]

. With the development of deep-learning techniques and the need for large-scale datasets to train deep models, Gu et al. proposed the AVA dataset

[14]. AVA is a large-scale dataset with Spatio-temporal annotations actors and actions, whose primary data sources are movies and TV-shows. Furthermore, other works have focused on action, and event recognition across shots [29, 18]. Finally, Pavlakos et al. leverage information across shots from TV shows to do human mesh reconstruction [32]. Instead of leveraging Movie data to learn representations for traditional tasks, we propose a new task to analyze movie cut types automatically.

Figure 2: MovieCuts Dataset. MovieCuts contains video clips labeled among 10 different cut types. Each sample in the dataset is composed of two shots (with a cut) and their accompanying audio. Our cuts are grouped into two major categories, visual (top) and audio-visual (bottom) driven. Each cut’s definition can be found in the supplementary material.

Stories, Plots, and Cinematography. Movies and TV shows are rich in complexity and content, which makes their analysis and understanding a challenging task. Movies are a natural multi-modal source of data, with audio, video, and even transcripts being often available. Several recent works in the literature focus on the task of understanding movie content. Recent works have work on the task of movie trailer creation [24, 17, 56, 38] and TV-shows summarization [6, 7]. Besides, Vicol proposed MovieGraphs [47], a dataset that uses movies to analyze human-centric situations. Rohrbach presented a Movie Description dataset [36], which contains audio narrative and movie scripts aligned to the movies’ full-length. Using this dataset, a Large Scale Movie Description Challenge (LSMDC) has hosted competitions for a variety of tasks, including Movie Fill-In-The-Blank [30], and movie Q&A [42], among others. Like LSMDC, MovieNet [21] and Condensed Movies [2] are big projects that contain several tasks, data, and annotations related to movie understanding. MovieNet includes works related to person re-identification [23, 54, 19, 20], Movie Scene Temporal Segmentation [35], and trailer and synopses analysis [55, 22]. All these works have shown that movies have rich information about human actions, including their specific challenges. However, only a few of them have focused on movies’ artistic aspects such as shot scales [34, 9, 3], shot taxonomy and classification [48], and their editing structure and cinematography [50, 51, 52]. These studies set the foundations to analyze movies’ editing properties but missed one of the most used techniques, the cut. Understanding the cuts is crucial for decoding the grammar of the film language of and our work bootstraps towards that goal.

3 The MovieCuts Dataset

3.1 Building MovieCuts

(a) Label Distribution
(b) Multi-label distribution and co-occurrences
(c) Cut duration distribution
(d) Cuts per scene genre
(e) Cuts per scene by production year
(f) Cuts per scene distribution
Figure 3: MovieCuts statistics. Figure2(a) shows that number of instances per cut type. Labels follow a long-tail distribution. Figure2(b) indicates that a large number of instances contain more than a single cut type. Moreover, certain pairs of cut types co-occur more often. Figure2(c) plots the distribution of lengths (in seconds) of all the dataset instances. Figure2(d) summarizes the production properties of the movie scenes and cuts used in our study. Figure2(e) shows the distribution based on year of production. Finally, Figure2(f) shows the distribution of number of cuts per scene.

Taxonomy of Cuts. Our goal is to find a set of cut categories often used in movie editing. Although there exists literature in the grammar of the film language [1, 40], and taxonomy of shot types [48, 9, 3], there is no gold-standard categorization of cuts. To cope with the lack of taxonomy, we surveyed a group of editors who helped us formalize a set of cut types. We gathered an initial taxonomy (17 cut types) from film-making courses ([45]), textbooks [5], and blogs [44]. We then hired ten different editors who helped us to validate the taxonomy. All the editors studied film-making, two have been nominated for the Emmys, and most have over 10 years of experience. Some of the original categories were duplicated and some of them (like Jump Cuts) were challenging to mine from movies. Our final taxonomy includes categories. Figure 2 illustrates each cut type along with their visual and audio signals.
We give a brief description of each one of the cut-types on MovieCuts: 1. Cutting on Action: Cutting from one shot to another while the subject is still in motion, 2. Cross Cut: Cutting back and forth within locations, 3. Emphasis Cut: Cut from wide to close within the same shot, or the other way around. 4. Cut Away: Cutting into an insert shot of something and then back, 5. Reaction Cut: A cut to a subject’s reaction (facial expression or single word) to other actors’ comments/actions, or a cut after reaction, 6. Speaker Change: A cut that changes the shot to the current speaker, 7. J Cut: The audio from the next shot begins before you get to see it. You hear what is going on before you actually see what is going on, 8. L Cut: The audio from the current shot carries over to the next shot, 9. Smash Cut: Abrupt cut from one shot to another for aesthetic, narrative or emotional purpose, 10. Match Cut: Cut from one shot to another by matching a concept, action, or the composition.

Video Collection and Processing. We need professionally edited videos containing diverse types of cuts. Movies are a perfect source to gather and collect such data. As pointed in [2], there are online video channels111MovieClips YouTube Channel: Source of the videos that distribute movie scenes to the public, which facilitates access to movie data for research. We downloaded scenes, which come from movies. However, these movie scenes come untrimmed, and further processing is required to obtain cuts from them. We automatically detect all the cuts in the dataset with a highly accurate shot detector [15] ( precision and recall), which yields a total of candidate cuts for annotation.

Human Annotations and Verification. Our quest at this stage is to collect expert annotations for candidate cuts. To do so, we hired Hive AI to run our annotation campaign. We choose them given their experience labeling data for the entertainment industry. Annotators did not necessarily have a film-making background, but they had to pass a qualification test to participate in the labeling process. At least three annotators reviewed each cut/label candidate pair, and only the annotations with more than two votes were kept. The annotators also have the option to discard cuts due to: (i) errors in the shot boundary detector, and (ii) the cut does not show any of the categories in our taxonomy. We also built a handbook in partnership with the pro editors to include several examples per class and guidelines on addressing edge cases. We discarded cuts, which left us with a total of cuts to form our dataset. To validate the quality of the annotations, we launched a second campaign to re-annotate cuts with a different pool of workers; we found that the ratio of annotation errors between both annotation campaigns was less than . This inter-coder agreement is only computed for samples that passed our consensus filter. From the discarded clips, we find that did not have enough consensus, which is a ratio of over the initial pool of candidate cuts. Note that the inter-coder agreement does not account for missing labels. We thus invited five professional editors to label the same two thousand cuts to create a high consensus ground truth. We found that the original annotations exhibited a precision and recall when contrasted with such a gold standard.

Please refer to the supplementary material for further details about: (i) examples of each cut type; (ii) additional information about the annotation consensus.

3.2 MovieCuts Statistics

(a) Sound Attributes
(b) Subject Attributes
(c) Location Attributes
Figure 4: MovieCuts attributes. MovieCuts contain diverse sounds 3(a), subjects 3(b), and locations 3(c). Some sounds co-occur more often with particular cut types. For instance, Speech is the predominant sound for dialogue related cut types such as Speaker Change or J Cut. Similarly, there exists correlation between cut types and the subjects depicted in the video clip.

Label distribution. Figure2(a) shows the distribution of cut types in MovieCuts. The distribution is long-tailed (Zipf’s law), which is natural and reflects the editors’ preferences for certain types of cuts. It is not a surprise that Reaction Cut is the most abundant label given that emotion and human reactions play a central role in storytelling. Beyond human emotion, dialogue and human actions are additional key components to advance video stories. We observe this is reflected in the label distribution, where Speaker Change and Cutting on Action are the second and third most existing categories in the dataset. While classes such as Smash Cut and Match Cut emerge scarcely in the dataset; it is still important to recognize these types of cuts, which arguably can be considered the most creative ones.

Multilabel distribution and co-occurrences. We plot in Figure 2(b) the distribution of labels per cut and the co-occurrence matrix. On the one hand, we observe that a significant number of cuts contain more than one label. On the other hand, we observe that certain pair of classes co-occur more often, Reaction Cut / L Cut. The multi-label properties of the dataset suggest that video editors compose and combine cut types quite often.

Duration of shot pairs. We define the cut duration as the duration of the shot pairs that form the cut. Figure 2(c) shows the distribution of cut duration. The most typical cut duration is about 3.5 seconds long. Moreover, we observe the length of cuts range from 2 seconds to more than 30 seconds. While some cut types such as Match Cut would not benefit from analyzing large temporal context, most cuts would benefit from leveraging extended time spans.

Cuts genre, year of production, and cuts per scene. Figures 2(d), 2(e), 2(f)

show statistics about the productions from where the cuts were sampled. First, we observe the cuts were sampled across a diverse set of genres, with Comedy being the most frequent one. Second, we sourced the cuts from old and contemporary movie scenes. While many cuts come from the last decade, we scouted cuts from movie scenes from the 30s. Finally, we observe that the number of cuts per scene roughly follows a normal distribution with a mean of 15 cuts per scene. Interestingly, few movie scenes have a single cut, while others may contain more than 60 cuts. These statistics highlight the editing diversity from where we mined the cuts.

3.3 MovieCuts Attributes

Sound Attributes. We leverage an off-the-shelf audio classifier [10] to annotate the sound events in the dataset; Figure 3(a) summarizes the distribution of three super-groups of sound events, Speech, Music, and Other. Dialogue related cuts such as Speaker Change and J Cut contain a large amount of speech. Contrarily, visual-driven cuts Match Cut and Smash Cut hold a larger number of varied sounds and background music. These attributes suggest that while analyzing speech is crucial for recognizing cut types, it is also beneficial to model music and general sounds.

Subject Attributes. We build a zero-shot classifier using CLIP [33] to tag the subjects present in our dataset samples (3(b)). Interestingly, dialogue and emotion-driven cuts (Reaction Cut) contain many face tags, which can be interpreted as humans framed in medium-to-close-up shots. Contrarily, Body is the most common attribute in the Cutting on Action class, which suggests editors often opt for framing humans in long shots when actions are happening.

Location Attributes. We reuse CLIP [33] to construct a zero-shot classifier of locations, which we apply to tag the dataset. Figure 3(c) summarizes the distribution of locations (only Interior/Exterior) per cut type. On the one hand, we observe that most cut types contain instances shot in Interior locations 60%-70% of the time. On the other hand, Match Cuts reverse this trend with the majority (53%) of the cuts shot at Exterior places. The obtained distribution suggests that stories in movies (as in real life) develop (more often) at indoor places.

4 Experiments

4.1 Audiovisual Baseline

Figure 5: Audio-visual pipeline We use a siamese network with two streams, one for the audio and one for the visuals. We concatenate their feature to form an audio-visual feature. We train both networks jointly using an audio loss , a visual loss and an audio-visual loss , weighted with , , , respectively.

Our base architecture is shown in Figure 5

. It takes as input the audio signal and a set of frames, that are then processed by a Siamese CNN to extract features per clip. We combine these features and use a Multi Layer Perceptron (MLP) to get the final predictions. We optimize a binary cross-entropy loss (

) per modality (Audio, Visual, and Audio-Visual) in a one-vs-all manner to deal with the problem’s multilabel nature. Our loss is summarized as:


where , , and are the weights for the audio, visual, and audio-visual losses, respectively. Using this architecture, we propose several baselines:
Linear classifier. We extract features for each stream, and train a linear classifier on them and their concatenation.
Backbone finetuning.

We train the whole backbone starting from Kinetics-400

[26] weights for the visual stream and from VGGSound weights for the audio stream [10].
Modality variants. We train our model using the different combinations of its modalities: audio only, visual only, and audio-visual modality. For the case of audio-visual modality, we combine the losses in a naive way giving each one of them the same weights ().
Modality blending. In order to combine the losses of each of the modalities in a better way, Wang et al. [49] proposed Gradient Blending (GB) a strategy to calculate the weight of each modality at training time. We use the offline Gradient-Blending algorithm to calculate the optimal weights , , and . For further details of the offline GB procedure, please refer to algorithms 1 and 2 of the original paper [49].
Distribution-Balanced Loss. To tackle datasets with multiple labels per sample that follow a long-tail distribution, Wu et al. [53] proposed a modification to the standard binary cross-entropy loss, which they called Distributed-Balanced Loss (DB Loss). Since our dataset fits these characteristics, we upgrade our base BCE loss by the DB Loss in Equation (1). For further details, including the loss formulation, please refer to the original publication [53].

4.2 Experimental Setting

Dataset summary. We divide our dataset into train, validation, and test sets by using , , and percent of the data, respectively. Thus, we use clips for training, clips for validation, and clips for testing. We report all of the experiments on the validation set unless otherwise mentioned.
Metrics. Following [53], we choose the mean Average Precision (mAP) across all classes to summarize and compare baselines’ performances. This metric helps to deal with our data’s multi-label nature. We also report the per-class AP.
Implementation details. For all our experiments, we use ResNet-18 [16] as the backbone for both visual and audio streams. For the audio stream, we use a ResNet with 2D convolutions pre-trained on VGGSound [10]. This backbone takes as input a spectrogram of the audio signal and processes it as a regular image. For the visual stream, we use a ResNet-(2+1)D [43] pre-trained on Kinetics-400 [26]

. We sample 16 frames from a Gaussian distribution centered around the cut as the input to the network. Using single streams , only audio or only visual, we use the features after the Average Pooling followed by an MLP composed of a

Fully-Connected (FC) layer followed by a ReLU and a

FC-Layer, where is the number of classes (). Using two streams, we concatenate the features after the first FC-layer of the MLP to obtain an audio-visual feature per clip of size . Then, we pass it through a second FC-layer of size to get the predictions. We train using SGD with momentum and weight decay of

. We also use a linear warm-up for the first epoch. We train for 8 epochs with an initial learning rate of

, which decays by a factor of 10 after 4 epochs. We use an effective batch-size of 80 and train on 4 NVIDIA V100 GPUs.

4.3 Results and Analysis

As described in section 4.1

, we benchmark our dataset using several combinations of modalities and strategies to combine them together, along with a specialized loss function to deal with multi-label long-tail-distribution datasets. Results are shown in Table


Cutting Cut Cross Emphasis Match Smash Reaction L J Speaker
Model mAP on Action Away Cut Cut Cut Cut Cut Cut Cut Change
Linear Audio 23.92 36.71 14.79 11.83 14.57 1.47 10.73 65.34 15.29 18.44 50.02
Visual 28.78 53.85 36.26 16.95 19.41 1.13 13.29 69.69 12.95 16.22 48.01
Audio-Visual 30.82 55.51 32.78 16.09 20.31 1.66 13.26 73.68 17.37 21.58 55.99
Fine-tune Audio 26.18 40.84 17.09 13.80 15.00 1.45 13.07 67.81 17.46 20.74 54.49
Visual 41.35 61.49 58.39 33.21 28.69 1.42 20.38 78.94 29.26 36.67 65.01
Audio-Visual (AV) 44.74 61.49 58.15 31.25 29.64 1.75 22.55 81.89 42.90 45.56 72.22
AV + GB[49] 45.43 62.12 63.46 34.14 30.39 1.62 20.64 82.20 40.89 45.19 73.64
AV + GB[49] + DB Loss[53] 45.72 63.91 63.04 33.24 29.60 1.84 22.11 81.20 41.61 45.91 74.70
Table 1: Baselines. We show the results of our different baselines using visual, audio, and audio-visual modalities. The last two rows use audio-visual modality combine with Gradient Blending and Distributed-Balanced Loss. All the reported numbers are AP. We observe three key findings. (1) Fine-tuning in MovieCuts provides clear benefits over the linear classifier trained on frozen features. (2) Audio-visual information boosts the performance’s of visual only streams. (3) Gradient Blending and the DB Loss push further performance by dealing with the task’s multi-modal and multi-label nature.

Linear Classifier vs. Fine-tune: We evaluate the performance of using frozen vs. fine-tuned features. As one might expect, the fine-tuning of the whole backbone shows consistent improvement over all the classes regardless of the modality used. For instance, the Audio-Visual backbone performance’s increased from to . These results validate the value of the dataset for improving the audio and visual representations encoded by the backbones.
Modality Variants: Consistently across training strategies, we observe a common pattern in the results: the visual modality performs better () at the task than its audio counterpart (). Nonetheless, combining both modalities still provides enhanced results for several classes and the overall mAP (). We observe that cuts driven mainly by visual cues like cut-on-action, cut-away, and cross-cuts do not improve their performance when adding audio. However, the rest of the classes improve when using both modalities. In particular, L-cuts, J-cuts, and Speaker-Change improve drastically since these three types of cuts are naturally driven by audio-visual cues.
Gradient Blending: The second-to-last row in Table 1 shows the results of using the three modalities combined with the GB weights , and . GB performs better () than combining the losses naively () where . We find that GB weights outperform all grid-search losses’ weights combinations we tested. Thus, we stick to GB weights.
Distributed Balanced Loss: The last row of Table 1 shows the combination of the gradient blending weights along with the DB loss. Even though the more represented classes do not improve or even decrease, the overall mAP reaches . Thus, the improvement comes from the least represented classes. These results reflect the DB loss behavior since it was explicitly designed to deal with long-tail distributions and balance the classes with the least number of samples. We observe that combining all the modalities along with GB and DB loss gives us the best results.
Window Sampling: In addition to these experiments, we explore how to pick the frames to feed into the visual network. For all the previous experiments, we use Gaussian Sampling by placing a Gaussian window centered around the cut and adjusting its size based on the length of the clip. However, this is not the only strategy to sample frames. We explore two other strategies: Uniform Sampling

which takes samples frames from a uniform distribution across the clip; and

Fixed Sampling which takes the 16 frames around the cut. We align the audio with the temporal span of the visual frames. We show the results of these three strategies on Table 2. The Gaussian strategy gives the best results (). Using Gaussian Sampling allows us to sample densely around the cut while including some context from the distribution’s tails. These results suggest that the most critical information lays around the cut but some context help the model to improve the results.

Model Sampling mAP
AV + GB[49] + DB Loss[53] Uniform 45.38
AV + GB[49] + DB Loss[53] Fixed 45.55
AV + GB[49] + DB Loss[53] Gaussian 45.72
Table 2: Window Sampling. We show different window sampling strategies to feed into the network.
(a) Performance breakdown per attribute and dataset characteristics.
(b) Audio-visual improvements per cut type
Figure 6: Performance breakdown.We showcase a detailed performance analysis. Figure a shows the performance breakdown according to attributes of MovieCuts such that type of sound, subjects, locations, number of labels, duration per clips, and editing year. Figure b shows the performance improvement the visual-only model when using the audio-visual model. We group the classes into visual-driven and audio-visual driven for the analysis.

Test Set Results After obtaining the best-performing model on the validation set, we evaluate our best validation model on the test set. We obtained mAP, which is comparable to the results on the validation set mAP. To see the full test set results, refer to supplementary material.

Figure 7: Qualitative Results. We showcase three examples for Cutting on Action. The blue box indicates True Positives, while the red box indicates False Positives.

4.4 Performance Breakdown

Attributes and dataset characteristics. Figure 5(a) summarizes the performance of our best AV model (GB[49] + DB Loss[53]) for different attributes and dataset characteristics. In most cases, the model exhibits robust performance across attributes. The largest performance gap is observed between Speech and Other sounds. We associate this result to the fact that cuts with complex audio editing, those that include sound effects, often employ abstract editing such as Smash Cuts and Match Cuts, which are harder for the model to recognize. The model is also robust across dataset characteristics such as cut duration or the year the cut was produced. However, we notice that the model’s performance drops significantly when there is a single label per cut. We observe that classes with multiple labels tend to belong to highly-represented classes in the dataset. However, single-label samples tend to belong to under-represented cut types in which the performance of the models is much lower. Thus, samples with multiple labels have higher accuracy than samples with single labels, as shown in figure 5(a). We hypothesize the DB Loss[53] is pushing the model to score more than one cut type with high scores to the rest of the classes. These two findings suggest that there is room for improvement to push performance by studying better audio backbones and doing a deep-dive on the multi-label nature of MovieCuts.

Audio-visual improvements per cut type. Figure 5(b) shows the relative improvements of the audio-visual model using the visual stream only. The figure highlights whether the type of cut is driven by visual or audio-visual information. First, we observe that overall the audio-visual driven cuts benefit the most for training a joint audio-visual model. Second, the largest gains for the audio-visually-driven cuts are for those related to dialogue and conversations. For instance, L Cut’s AP improves 40%; such class typically involves an on-screen person talking in the first shot while only their voice is heard in the second shot. By encoding audio-visual information, the model disambiguates predictions that otherwise, with the visual model, would be confusing. Finally, all classes present relative improvement the visual baseline. This suggests that the Gradient Blending [49] strategy allows the model to optimize modalities’ weights, achieving, in the worst-case scenario, slight improvements over the visual baseline. In short, we empirically demonstrate the importance of modeling audio-visual information to recognize cut types.

4.5 Qualitative Results

We showcase representative qualitative results for the Cut on Action class on Figure 7. We observe that the first two cuts are correctly classified as cutting on action since the cut happens right after the action is performed (Gunshot and boxing punch). The third example is a false positive. The model wrongly predicts it as a Reaction Cut. The model fails gracefully; the shot focuses on the face of the actor right before the action, which is similar to what happens in Reaction cuts. At the end, the actor is not reacting but is performing an action across the cut. For more qualitative results, please refer to the supplementary material.

5 Conclusion

We introduced the task of cut type recognition and kick-started research in this new area by providing a new large-scale dataset accompanied by a benchmark of multiple audiovisual baselines. To construct the new dataset, we collected more than 170K annotations from qualified human workers. We analyzed the dataset diversity and uniqueness by studying its properties and audiovisual attributes. We proposed audiovisual baselines that include recent approaches that address the multi-modal and unbalance learning nature of the problem. While we set a strong research departing point, we hope that further research pushes the envelope of cut type recognition by leveraging MovieCuts. Acknowledgments This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.


  • [1] D. Arijon (1976) Grammar of the film language. Focal Press London. Cited by: Appendix B, §1, §3.1.
  • [2] M. Bain, A. Nagrani, A. Brown, and A. Zisserman (2020) Condensed movies: story based retrieval with contextual embeddings. External Links: 2005.04208 Cited by: §1, §2, §3.1.
  • [3] S. Benini, M. Svanera, N. Adami, R. Leonardi, and A. B. Kovács (2016) Shot scale distribution in art films. Multimedia Tools and Applications 75 (23), pp. 16499–16527. Cited by: §2, §3.1.
  • [4] P. Bojanowski, F. Bach, I. Laptev, J. Ponce, C. Schmid, and J. Sivic (2013) Finding actors and actions in movies. In Proceedings of the IEEE international conference on computer vision, pp. 2280–2287. Cited by: §2.
  • [5] D. Bordwell, K. Thompson, and J. Smith (1993) Film art: an introduction. Vol. 7, McGraw-Hill New York. Cited by: §3.1.
  • [6] X. Bost, S. Gueye, V. Labatut, M. Larson, G. Linarès, D. Malinas, and R. Roth (2019) Remembering winter was coming. Multimedia Tools and Applications 78 (24), pp. 35373–35399. Cited by: §2.
  • [7] X. Bost, V. Labatut, and G. Linares (2020) Serial speakers: a dataset of tv series. arXiv preprint arXiv:2002.06923. Cited by: §2.
  • [8] A. Brown, J. Huh, A. Nagrani, J. S. Chung, and A. Zisserman (2020) Playing a part: speaker verification at the movies. arXiv preprint arXiv:2010.15716. Cited by: §1.
  • [9] L. Canini, S. Benini, and R. Leonardi (2013) Classifying cinematographic shot types. Multimedia tools and applications 62 (1), pp. 51–73. Cited by: §1, §2, §3.1.
  • [10] H. Chen, W. Xie, A. Vedaldi, and A. Zisserman (2020) VGGSound: a large-scale audio-visual dataset. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Cited by: §3.3, §4.1, §4.2.
  • [11] J. E. Cutting (2016) The evolution of pace in popular movies. Cognitive research: principles and implications 1 (1), pp. 30. Cited by: §1.
  • [12] O. Duchenne, I. Laptev, J. Sivic, F. Bach, and J. Ponce (2009) Automatic annotation of human actions in video. In 2009 IEEE 12th International Conference on Computer Vision, pp. 1491–1498. Cited by: §2.
  • [13] M. Everingham, J. Sivic, and A. Zisserman (2006) Hello! my name is… buffy”–automatic naming of characters in tv video.. In BMVC, Vol. 2, pp. 6. Cited by: §1.
  • [14] C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar, et al. (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    pp. 6047–6056. Cited by: §1, §2.
  • [15] M. Gygli (2018)

    Ridiculously fast shot boundary detection with fully convolutional neural networks

    In 2018 International Conference on Content-Based Multimedia Indexing (CBMI), pp. 1–4. Cited by: §3.1.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.
  • [17] M. Hesham, B. Hani, N. Fouad, and E. Amer (2018) Smart trailer: automatic generation of movie trailer using only subtitles. In 2018 First International Workshop on Deep and Representation Learning (IWDRL), pp. 26–30. Cited by: §2.
  • [18] M. Hoai and A. Zisserman (2014) Thread-safe: towards recognizing human actions across shot boundaries. In Asian Conference on Computer Vision, pp. 222–237. Cited by: §1, §2.
  • [19] Q. Huang, W. Liu, and D. Lin (2018) Person search in videos with one portrait through visual and temporal links. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 425–441. Cited by: §2.
  • [20] Q. Huang, Y. Xiong, and D. Lin (2018-06) Unifying identification and context learning for person recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [21] Q. Huang, Y. Xiong, A. Rao, J. Wang, and D. Lin (2020) MovieNet: a holistic dataset for movie understanding. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.
  • [22] Q. Huang, Y. Xiong, Y. Xiong, Y. Zhang, and D. Lin (2018) From trailers to storylines: an efficient way to learn from movies. arXiv preprint arXiv:1806.05341. Cited by: §2.
  • [23] Q. Huang, L. Yang, H. Huang, T. Wu, and D. Lin (2020)

    Caption-supervised face recognition: training a state-of-the-art face model without manual annotation

    In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [24] G. Irie, T. Satou, A. Kojima, T. Yamasaki, and K. Aizawa (2010) Automatic trailer generation. In Proceedings of the 18th ACM international conference on Multimedia, pp. 839–842. Cited by: §2.
  • [25] E. Katz and F. Klein (2005) The film encyclopedia. Collins. Cited by: §1.
  • [26] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.1, §4.2.
  • [27] A. K. Kozlovic (2007) Anatomy of film. Kinema: A Journal for Film and Audiovisual Media. Cited by: §1.
  • [28] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld (2008) Learning realistic human actions from movies. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §1, §2.
  • [29] X. Liu, Y. Hu, S. Bai, F. Ding, X. Bai, and P. H. Torr (2020) Multi-shot temporal event localization: a benchmark. arXiv preprint arXiv:2012.09434. Cited by: §1, §2.
  • [30] T. Maharaj, N. Ballas, A. Rohrbach, A. Courville, and C. Pal (2017) A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6884–6893. Cited by: §2.
  • [31] A. Nagrani and A. Zisserman (2018) From benedict cumberbatch to sherlock holmes: character identification in tv series without a script. arXiv preprint arXiv:1801.10442. Cited by: §1.
  • [32] G. Pavlakos, J. Malik, and A. Kanazawa (2020) Human mesh recovery from multiple shots. arXiv preprint arXiv:2012.09843. Cited by: §2.
  • [33] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: Appendix B, Appendix B, §3.3, §3.3.
  • [34] A. Rao, J. Wang, L. Xu, X. Jiang, Q. Huang, B. Zhou, and D. Lin (2020) A unified framework for shot type classification based on subject centric lens. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.
  • [35] A. Rao, L. Xu, Y. Xiong, G. Xu, Q. Huang, B. Zhou, and D. Lin (2020) A local-to-global approach to multi-modal movie scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10155. Cited by: §2.
  • [36] A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. Pal, H. Larochelle, A. Courville, and B. Schiele (2017) Movie description. International Journal of Computer Vision 123 (1), pp. 94–120. Cited by: §1, §2.
  • [37] J. Sivic, M. Everingham, and A. Zisserman (2009) “Who are you?”-learning person specific classifiers from video. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1145–1152. Cited by: §2.
  • [38] J. R. Smith, D. Joshi, B. Huet, W. Hsu, and J. Cota (2017) Harnessing ai for augmenting creativity: application to movie trailer creation. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1799–1808. Cited by: §2.
  • [39] T. J. Smith and J. M. Henderson (2008) Edit blindness: the relationship between attention and global change blindness in dynamic scenes.. Journal of Eye Movement Research 2 (2). Cited by: Figure 1.
  • [40] T. J. Smith, D. Levin, and J. E. Cutting (2012) A window on reality: perceiving edited moving images. Current Directions in Psychological Science 21 (2), pp. 107–113. Cited by: Figure 1, §3.1.
  • [41] T. J. Smith (2012) The attentional theory of cinematic continuity. Projections 6 (1), pp. 1–27. Cited by: Appendix B.
  • [42] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler (2016) MovieQA: Understanding Stories in Movies through Question-Answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
  • [43] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §4.2.
  • [44] Note: Cited by: §3.1.
  • [45] Note: Cited by: §3.1.
  • [46] Y. Tsivian (2009) Cinemetrics, part of the humanities’ cyberinfrastructure. Cited by: §1, §1.
  • [47] P. Vicol, M. Tapaswi, L. Castrejon, and S. Fidler (2018) MovieGraphs: towards understanding human-centric situations from videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
  • [48] H. L. Wang and L. Cheong (2009) Taxonomy of directing semantics for film shot classification. IEEE transactions on circuits and systems for video technology 19 (10), pp. 1529–1542. Cited by: §2, §3.1.
  • [49] W. Wang, D. Tran, and M. Feiszli (2020) What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705. Cited by: Table 3, Table 4, Appendix C, §1, §4.1, §4.4, §4.4, Table 1, Table 2.
  • [50] H. Wu and M. Christie (2016) Analysing cinematography with embedded constrained patterns. In WICED-Eurographics Workshop on Intelligent Cinematography and Editing, Cited by: §2.
  • [51] H. Wu, Q. Galvane, C. Lino, and M. Christie (2017) Analyzing elements of style in annotated film clips. In WICED 2017-Eurographics Workshop on Intelligent Cinematography and Editing, pp. 29–35. Cited by: §1, §2.
  • [52] H. Wu, F. Palù, R. Ranon, and M. Christie (2018) Thinking like a director: film editing patterns for virtual cinematographic storytelling. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 1–22. Cited by: §1, §2.
  • [53] T. Wu, Q. Huang, Z. Liu, Y. Wang, and D. Lin (2020) Distribution-balanced loss for multi-label classification in long-tailed datasets. In European Conference on Computer Vision, pp. 162–178. Cited by: Table 3, Table 4, Appendix C, §1, §4.1, §4.2, §4.4, Table 1, Table 2.
  • [54] J. Xia, A. Rao, L. Xu, Q. Huang, J. Wen, and D. Lin (2020) Online multi-modal person search in videos. In The European Conference on Computer Vision (ECCV), Cited by: §2.
  • [55] Y. Xiong, Q. Huang, L. Guo, H. Zhou, B. Zhou, and D. Lin (2019-10) A graph-based framework to bridge movies and synopses. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
  • [56] H. Xu, Y. Zhen, and H. Zha (2015) Trailer generation via a point process-based visual attractiveness model. In

    Twenty-Fourth International Joint Conference on Artificial Intelligence

    Cited by: §2.

Supplementary Material

Please visit: for code and complete supplementary material.

Appendix A MovieCuts Annotation Consensus

To analyze the quality of MovieCuts’ annotations, we run an additional labeling campaign and measure the consensus between two disjoint sets of human annotators. We sample clips randomly from the initially annotated set towards this goal. After receiving the annotations, we then compute the number of responses in agreement concerning the first campaign. We notice that about of the times, both sets of workers agreed on the labels placed to each candidate’s cuts. While this can be seen as a high annotation quality ratio, we believe there is room for improvement to reduce the rate of missing labels. A potential solution for future annotation campaigns would be to increase the number of workers required to annotate each cut, which is currently set to three.

Appendix B MovieCuts Attributes Details

We leverage CLIP [33] to extract visual attributes from all instances in MovieCuts. We follow the zero-shot setup described in [33]

. To do so, we create language queries using relevant classes, for each attribute type, as a set of candidate text-visual pairs and use CLIP’s dual encoder to predict the most probable pair (the most probable tag). Thus, we compute an image embedding for the visual frames, and a text embedding for all candidate text queries (attribute tags) to then compute the cosine similarity between the L2-normalized embedding pairs. Instead of simply passing the tags to the language encoder, we augment the text queries using the following template: “a photo of a

subject attribute”, and “an location attribute photo” for the subject and location attributes, respectively. We retrieve tags for each of its shots by sampling a random frame before and after the shot transition from each cut.

Actions that trigger Cuts. Our goal is to find correlations between action tags and cut types. To do so, we first build a zero-shot action classifier based on CLIP [33]. Since the zero-shot action classifier did not offer us a high accuracy, we limit our analysis with the most confident tags only. Such tags allow us to find the most common co-occurrences between actions and cut-types. Figure 8 showcases three common action/cut pairs. These patterns are common across different movie scenes and editors’ styles. These empirical findings reaffirm the theory of the film grammar [1, 41], which suggests that video editing follows a set of rules more often than not.

Figure 8: Actions that trigger cuts. Some actions predominantly co-occur with a particular cut type. For instance, a Talking on the phone action is often edited via the Cross Cut and J Cuts. Another common pattern emerges when someone is Holding a gun. The predominant edit is the Reaction Cut, which first shows an actor holding the gun, and the next shot highlights a face reaction of another subject.

Appendix C Additional Results

Cutting Cut Cross Emphasis Match Smash Reaction L J Speaker
Model Sampling mAP on Action Away Cut Cut Cut Cut Cut Cut Cut Change
AV + GB[49] + DB Loss[53] Uniform 45.38 61.82 61.38 30.42 29.91 1.28 21.10 81.95 42.97 46.59 76.39
AV + GB[49] + DB Loss[53] Fixed 45.55 62.66 60.25 32.11 28.73 1.86 21.82 81.51 42.39 48.60 75.51
AV + GB[49] + DB Loss[53] Gaussian 45.72 63.91 63.04 33.24 29.60 1.84 22.11 81.20 41.61 45.91 74.70
Table 3: Window Sampling Results. We show different window sampling strategies to feed into the network. All the reported numbers are AP.
Cutting Cut Cross Emphasis Match Smash Reaction L J Speaker
Model Sampling mAP on Action Away Cut Cut Cut Cut Cut Cut Cut Change
Audio Gaussian 25.36 41.85 16.15 12.72 14.14 1.23 12.37 67.01 15.38 18.77 54.04
Visual Gaussian 41.27 62.01 59.02 29.59 29.52 1.89 20.21 78.89 31.47 36.03 64.04
Audio-Visual (AV) Gaussian 44.18 61.46 53.17 31.27 30.19 1.70 24.47 81.94 39.76 43.57 74.22
AV + GB[49] Gaussian 44.95 62.82 61.80 30.78 29.72 1.69 23.12 81.85 39.90 44.00 73.81
AV + GB[49] + DB Loss[53] Gaussian 45.32 63.90 62.38 30.96 29.33 2.15 23.93 81.04 39.95 45.14 74.45
Table 4: Test Set Results. We show the performance of the fine-tuned models from table 1 of the main paper, evaluated on the test set. The reported number is mAP.

We showcase the results on each one of the classes for the Window Sampling study in table 3.
Besides, in Figure 9

we showcase the PR curves for our best model on the validation set as an additional metrics to the ones shown in the main manuscript. We observe the Precision and Recall values for different confident thresholds for each one of the classes in MovieCuts.

Figure 9: Precision vs Recall curves for our best model on the validation set.

Additionally, in table 4 we present the most important baselines reported on the test set. We notice that the same analysis made on the validation set applies on the test set. The visual signal is stronger than the audio signal. However, when combining them some classes improve drastically. Besides, the use of Gradient Blending [49] improve the naïve combination of modalities. Finally, the use of DB Loss [53] improves even further the results. The best model on both validation and test sets is obtained when combined all the techniques from all the other models.

Finally, on the attached slides there are several samples of cuts per category and some qualitative results.

Appendix D Addtional Stadistics

Figure 10

summarizes the difference in labels’ distribution across genres. To do this visualization we first calculate the average numbers of cuts for each of the classes, then, we plot the standard deviation from the classes’ mean for each of the genres. Thus, we visualize how frequent or infrequent is each of the classes depending on the movie genre. For instance, we observe that for genres like Romance, and Drama the classes Speaker Change, J-cuts and L-cuts are more frequent as compared to Action, and Adventure movies. However, for Action, and Adventure, Cross-cuts and Cuts on Action are more frequent as compared to Romance, and Drama.

Figure 10: Summary of the cut-types’ distribution per movie genre.

Additional to figure 10, in figure Figure 11 we show the distribution of classes for the most represented genres for the different splits, train 10(a), validation 10(b) and testing 10(c). We see that the distributions across splits are independent and identically distributed (iid).

(a) Train
(b) Validation
(c) Test
Figure 11: Cut-type distribution across movie genres and MovieCuts splits