Professionally edited movies use what is known as film grammar  as conventions that rule the structure of a film. Thus, a scene is a section of a motion picture in a single location and continuous-time made up of a series of shots . A shot is a set of contiguous frames from individual cameras from varying angles . Finally, a shot transition or Cut is the most basic video editing device, it makes the transition from one shot to another [27, 46]. Cuts create a change of perspectives, highlight emotions, and help to advance stories [27, 46]. Consequently, and due to the advances in digital editing, their usage in movies has steadily increased over time . Cuts in professionally edited movies are not random but rather have language, structures, and a taxonomy . Each cut in a movie has a purpose and a specific meaning. Thus, to understand movie content, one has to understand the cuts. Tsivian et. al. introduced a platform called Cinemetrics 
to analyze movies by analysing their cut frequency. However, having movies with dozens of cuts per scene, it is rather a difficult task to analyze and understand the cuts. Despite their importance in editing, widespread adoption, and meaning in movies, research to understand creative cuts remain unexplored in the computer vision community. We argue that automatically recognizing and understanding cut types would make a step towards decoding video editing’s core principles. Furthermore, recognizing cut types can unleash new experiences to the video editing industry, such as movie analysis for education, video re-editing, virtual cinematography, machine-assisted trailer generation, among others.
Figure 1 illustrates the cut type recognition task introduced in this work. A Cut is composed by two adjacent shots and the transition between them. Cuts are not only made of visual frames but also of its time-aligned sound stream. In many situations, the sounds and audio drive the cuts, shaping the meaning of a cut. Our goal is then to recognize the cut type by analyzing the clip’s audio-visual information across shots. Multiple research challenges emerge from this new multi-shot video understanding task. First, there is a need for a high-level understanding of visual and audio relationships over time to identify the intended cut type. For instance, to recognize the J Cut type in example 1, one needs a detailed audio-visual inspection to associate that the sound in Shot B corresponds to the actor’s voice in Shot A. However, Reaction Cuts from figure 1 need mainly a visual understanding of facial expressions. Furthermore, a cut may belong to multiple labels that require orthogonal analyses to make correct predictions. We argue these challenges can promote the development of new architectures that address the multi-modal and multi-shot nature of the problem.
Understanding the audio-visual properties of movies has a long-standing track of interest [28, 14, 36, 47]. The community have developed methods to recognize characters and speaker [31, 13, 8], events and actions [29, 14, 18], story-lines [2, 47, 21], shot-level cinematography properties such as shot-scale and camera motion [34, 9], and mine shot-sequencing patterns [52, 51, 46]. While these approaches have set an initial framework for understanding editing in movies, there is still a lack of automated tools that understand the most basic and used editing technique, the cut.
This work aims to study and bootstrap research in cut type recognition. To do so, we introduce MovieCuts, a new large-scale dataset with manually curated multi-label cut type annotations. Our new dataset contains more clips (with cuts) labeled among ten different categories. We survey professional editors to define our taxonomy and qualified annotators to label the cut type categories. Our MovieCuts dataset offers the opportunity to benchmark core research problems such as multi-modal analysis, learning from long-tailed distributions, and multi-label classification. We benchmark a variety of audio-visual baselines. While we observe improvements by leveraging recent techniques for audio-visual blending  and long-tailed multi-label learning , there is ample room for improvement, and the task remains an open research problem.
Contributions. Our quest is to bootstrap research in cut type understanding. Our contributions are threefold:
We introduce the cut type recognition task. To the best of our knowledge, our work is the first to address and formalize the task from a machine learning perspective.
(2) We collect a large-scale dataset containing qualified human annotations that verify the presence of different cut types. We do an extensive analysis of the dataset to highlight its properties and the challenges it hinders. We called this dataset MovieCuts (Section 3).
(3) We implement multiple audio-visual baselines to establish a benchmark in cut type recognition (Section 4).
2 Related Work
Edited Content in Video Understanding. Edited video content such as movies has been a rich source of data for general video understanding. Such video sources contain various human actions, objects, and situations occurring in people’s daily life. In the early stages of action recognition, Bojanowski proposed a dataset called Hollywood Human Actions (HOHA) , which contains short clips from 32 Hollywood movies with annotations for human-action recognition. Another group of works used a limited number of films to train, and test methods for character recognition , human action localization , event localization , and spatio-temporal action and character localization 
. With the development of deep-learning techniques and the need for large-scale datasets to train deep models, Gu et al. proposed the AVA dataset. AVA is a large-scale dataset with Spatio-temporal annotations actors and actions, whose primary data sources are movies and TV-shows. Furthermore, other works have focused on action, and event recognition across shots [29, 18]. Finally, Pavlakos et al. leverage information across shots from TV shows to do human mesh reconstruction . Instead of leveraging Movie data to learn representations for traditional tasks, we propose a new task to analyze movie cut types automatically.
Stories, Plots, and Cinematography. Movies and TV shows are rich in complexity and content, which makes their analysis and understanding a challenging task. Movies are a natural multi-modal source of data, with audio, video, and even transcripts being often available. Several recent works in the literature focus on the task of understanding movie content. Recent works have work on the task of movie trailer creation [24, 17, 56, 38] and TV-shows summarization [6, 7]. Besides, Vicol proposed MovieGraphs , a dataset that uses movies to analyze human-centric situations. Rohrbach presented a Movie Description dataset , which contains audio narrative and movie scripts aligned to the movies’ full-length. Using this dataset, a Large Scale Movie Description Challenge (LSMDC) has hosted competitions for a variety of tasks, including Movie Fill-In-The-Blank , and movie Q&A , among others. Like LSMDC, MovieNet  and Condensed Movies  are big projects that contain several tasks, data, and annotations related to movie understanding. MovieNet includes works related to person re-identification [23, 54, 19, 20], Movie Scene Temporal Segmentation , and trailer and synopses analysis [55, 22]. All these works have shown that movies have rich information about human actions, including their specific challenges. However, only a few of them have focused on movies’ artistic aspects such as shot scales [34, 9, 3], shot taxonomy and classification , and their editing structure and cinematography [50, 51, 52]. These studies set the foundations to analyze movies’ editing properties but missed one of the most used techniques, the cut. Understanding the cuts is crucial for decoding the grammar of the film language of and our work bootstraps towards that goal.
3 The MovieCuts Dataset
3.1 Building MovieCuts
Taxonomy of Cuts. Our goal is to find a set of cut categories often used in movie editing. Although there exists literature in the grammar of the film language [1, 40], and taxonomy of shot types [48, 9, 3], there is no gold-standard categorization of cuts. To cope with the lack of taxonomy, we surveyed a group of editors who helped us formalize a set of cut types. We gathered an initial taxonomy (17 cut types) from film-making courses (), textbooks , and blogs . We then hired ten different editors who helped us to validate the taxonomy. All the editors studied film-making, two have been nominated for the Emmys, and most have over 10 years of experience. Some of the original categories were duplicated and some of them (like Jump Cuts) were challenging to mine from movies. Our final taxonomy includes categories. Figure 2 illustrates each cut type along with their visual and audio signals.
We give a brief description of each one of the cut-types on MovieCuts: 1. Cutting on Action: Cutting from one shot to another while the subject is still in motion, 2. Cross Cut: Cutting back and forth within locations, 3. Emphasis Cut: Cut from wide to close within the same shot, or the other way around. 4. Cut Away: Cutting into an insert shot of something and then back, 5. Reaction Cut: A cut to a subject’s reaction (facial expression or single word) to other actors’ comments/actions, or a cut after reaction, 6. Speaker Change: A cut that changes the shot to the current speaker, 7. J Cut: The audio from the next shot begins before you get to see it. You hear what is going on before you actually see what is going on, 8. L Cut: The audio from the current shot carries over to the next shot, 9. Smash Cut: Abrupt cut from one shot to another for aesthetic, narrative or emotional purpose, 10. Match Cut: Cut from one shot to another by matching a concept, action, or the composition.
Video Collection and Processing. We need professionally edited videos containing diverse types of cuts. Movies are a perfect source to gather and collect such data. As pointed in , there are online video channels111MovieClips YouTube Channel: Source of the videos that distribute movie scenes to the public, which facilitates access to movie data for research. We downloaded scenes, which come from movies. However, these movie scenes come untrimmed, and further processing is required to obtain cuts from them. We automatically detect all the cuts in the dataset with a highly accurate shot detector  ( precision and recall), which yields a total of candidate cuts for annotation.
Human Annotations and Verification. Our quest at this stage is to collect expert annotations for candidate cuts. To do so, we hired Hive AI to run our annotation campaign. We choose them given their experience labeling data for the entertainment industry. Annotators did not necessarily have a film-making background, but they had to pass a qualification test to participate in the labeling process. At least three annotators reviewed each cut/label candidate pair, and only the annotations with more than two votes were kept. The annotators also have the option to discard cuts due to: (i) errors in the shot boundary detector, and (ii) the cut does not show any of the categories in our taxonomy. We also built a handbook in partnership with the pro editors to include several examples per class and guidelines on addressing edge cases. We discarded cuts, which left us with a total of cuts to form our dataset. To validate the quality of the annotations, we launched a second campaign to re-annotate cuts with a different pool of workers; we found that the ratio of annotation errors between both annotation campaigns was less than . This inter-coder agreement is only computed for samples that passed our consensus filter. From the discarded clips, we find that did not have enough consensus, which is a ratio of over the initial pool of candidate cuts. Note that the inter-coder agreement does not account for missing labels. We thus invited five professional editors to label the same two thousand cuts to create a high consensus ground truth. We found that the original annotations exhibited a precision and recall when contrasted with such a gold standard.
Please refer to the supplementary material for further details about: (i) examples of each cut type; (ii) additional information about the annotation consensus.
3.2 MovieCuts Statistics
Label distribution. Figure2(a) shows the distribution of cut types in MovieCuts. The distribution is long-tailed (Zipf’s law), which is natural and reflects the editors’ preferences for certain types of cuts. It is not a surprise that Reaction Cut is the most abundant label given that emotion and human reactions play a central role in storytelling. Beyond human emotion, dialogue and human actions are additional key components to advance video stories. We observe this is reflected in the label distribution, where Speaker Change and Cutting on Action are the second and third most existing categories in the dataset. While classes such as Smash Cut and Match Cut emerge scarcely in the dataset; it is still important to recognize these types of cuts, which arguably can be considered the most creative ones.
Multilabel distribution and co-occurrences. We plot in Figure 2(b) the distribution of labels per cut and the co-occurrence matrix. On the one hand, we observe that a significant number of cuts contain more than one label. On the other hand, we observe that certain pair of classes co-occur more often, Reaction Cut / L Cut. The multi-label properties of the dataset suggest that video editors compose and combine cut types quite often.
Duration of shot pairs. We define the cut duration as the duration of the shot pairs that form the cut. Figure 2(c) shows the distribution of cut duration. The most typical cut duration is about 3.5 seconds long. Moreover, we observe the length of cuts range from 2 seconds to more than 30 seconds. While some cut types such as Match Cut would not benefit from analyzing large temporal context, most cuts would benefit from leveraging extended time spans.
show statistics about the productions from where the cuts were sampled. First, we observe the cuts were sampled across a diverse set of genres, with Comedy being the most frequent one. Second, we sourced the cuts from old and contemporary movie scenes. While many cuts come from the last decade, we scouted cuts from movie scenes from the 30s. Finally, we observe that the number of cuts per scene roughly follows a normal distribution with a mean of 15 cuts per scene. Interestingly, few movie scenes have a single cut, while others may contain more than 60 cuts. These statistics highlight the editing diversity from where we mined the cuts.
3.3 MovieCuts Attributes
Sound Attributes. We leverage an off-the-shelf audio classifier  to annotate the sound events in the dataset; Figure 3(a) summarizes the distribution of three super-groups of sound events, Speech, Music, and Other. Dialogue related cuts such as Speaker Change and J Cut contain a large amount of speech. Contrarily, visual-driven cuts Match Cut and Smash Cut hold a larger number of varied sounds and background music. These attributes suggest that while analyzing speech is crucial for recognizing cut types, it is also beneficial to model music and general sounds.
Subject Attributes. We build a zero-shot classifier using CLIP  to tag the subjects present in our dataset samples (3(b)). Interestingly, dialogue and emotion-driven cuts (Reaction Cut) contain many face tags, which can be interpreted as humans framed in medium-to-close-up shots. Contrarily, Body is the most common attribute in the Cutting on Action class, which suggests editors often opt for framing humans in long shots when actions are happening.
Location Attributes. We reuse CLIP  to construct a zero-shot classifier of locations, which we apply to tag the dataset. Figure 3(c) summarizes the distribution of locations (only Interior/Exterior) per cut type. On the one hand, we observe that most cut types contain instances shot in Interior locations 60%-70% of the time. On the other hand, Match Cuts reverse this trend with the majority (53%) of the cuts shot at Exterior places. The obtained distribution suggests that stories in movies (as in real life) develop (more often) at indoor places.
4.1 Audiovisual Baseline
Our base architecture is shown in Figure 5
. It takes as input the audio signal and a set of frames, that are then processed by a Siamese CNN to extract features per clip. We combine these features and use a Multi Layer Perceptron (MLP) to get the final predictions. We optimize a binary cross-entropy loss () per modality (Audio, Visual, and Audio-Visual) in a one-vs-all manner to deal with the problem’s multilabel nature. Our loss is summarized as:
where , , and are the weights for the audio, visual, and audio-visual losses, respectively. Using this architecture, we propose several baselines:
Linear classifier. We extract features for each stream, and train a linear classifier on them and their concatenation.
We train the whole backbone starting from Kinetics-400 weights for the visual stream and from VGGSound weights for the audio stream .
Modality variants. We train our model using the different combinations of its modalities: audio only, visual only, and audio-visual modality. For the case of audio-visual modality, we combine the losses in a naive way giving each one of them the same weights ().
Modality blending. In order to combine the losses of each of the modalities in a better way, Wang et al.  proposed Gradient Blending (GB) a strategy to calculate the weight of each modality at training time. We use the offline Gradient-Blending algorithm to calculate the optimal weights , , and . For further details of the offline GB procedure, please refer to algorithms 1 and 2 of the original paper .
Distribution-Balanced Loss. To tackle datasets with multiple labels per sample that follow a long-tail distribution, Wu et al.  proposed a modification to the standard binary cross-entropy loss, which they called Distributed-Balanced Loss (DB Loss). Since our dataset fits these characteristics, we upgrade our base BCE loss by the DB Loss in Equation (1). For further details, including the loss formulation, please refer to the original publication .
4.2 Experimental Setting
Dataset summary. We divide our dataset into train, validation, and test sets by using , , and percent of the data, respectively. Thus, we use clips for training, clips for validation, and clips for testing. We report all of the experiments on the validation set unless otherwise mentioned.
Metrics. Following , we choose the mean Average Precision (mAP) across all classes to summarize and compare baselines’ performances. This metric helps to deal with our data’s multi-label nature. We also report the per-class AP.
Implementation details. For all our experiments, we use ResNet-18  as the backbone for both visual and audio streams. For the audio stream, we use a ResNet with 2D convolutions pre-trained on VGGSound . This backbone takes as input a spectrogram of the audio signal and processes it as a regular image. For the visual stream, we use a ResNet-(2+1)D  pre-trained on Kinetics-400 
. We sample 16 frames from a Gaussian distribution centered around the cut as the input to the network. Using single streams , only audio or only visual, we use the features after the Average Pooling followed by an MLP composed of a
Fully-Connected (FC) layer followed by a ReLU and aFC-Layer, where is the number of classes (). Using two streams, we concatenate the features after the first FC-layer of the MLP to obtain an audio-visual feature per clip of size . Then, we pass it through a second FC-layer of size to get the predictions. We train using SGD with momentum and weight decay of
. We also use a linear warm-up for the first epoch. We train for 8 epochs with an initial learning rate of, which decays by a factor of 10 after 4 epochs. We use an effective batch-size of 80 and train on 4 NVIDIA V100 GPUs.
4.3 Results and Analysis
As described in section 4.1
, we benchmark our dataset using several combinations of modalities and strategies to combine them together, along with a specialized loss function to deal with multi-label long-tail-distribution datasets. Results are shown in Table1.
|AV + GB||45.43||62.12||63.46||34.14||30.39||1.62||20.64||82.20||40.89||45.19||73.64|
|AV + GB + DB Loss||45.72||63.91||63.04||33.24||29.60||1.84||22.11||81.20||41.61||45.91||74.70|
Linear Classifier vs. Fine-tune: We evaluate the performance of using frozen vs. fine-tuned features.
As one might expect, the fine-tuning of the whole backbone shows consistent improvement over all the classes regardless of the modality used. For instance, the Audio-Visual backbone performance’s increased from to . These results validate the value of the dataset for improving the audio and visual representations encoded by the backbones.
Modality Variants: Consistently across training strategies, we observe a common pattern in the results: the visual modality performs better () at the task than its audio counterpart (). Nonetheless, combining both modalities still provides enhanced results for several classes and the overall mAP (). We observe that cuts driven mainly by visual cues like cut-on-action, cut-away, and cross-cuts do not improve their performance when adding audio. However, the rest of the classes improve when using both modalities. In particular, L-cuts, J-cuts, and Speaker-Change improve drastically since these three types of cuts are naturally driven by audio-visual cues.
Gradient Blending: The second-to-last row in Table 1 shows the results of using the three modalities combined with the GB weights , and . GB performs better () than combining the losses naively () where . We find that GB weights outperform all grid-search losses’ weights combinations we tested. Thus, we stick to GB weights.
Distributed Balanced Loss: The last row of Table 1 shows the combination of the gradient blending weights along with the DB loss. Even though the more represented classes do not improve or even decrease, the overall mAP reaches . Thus, the improvement comes from the least represented classes. These results reflect the DB loss behavior since it was explicitly designed to deal with long-tail distributions and balance the classes with the least number of samples. We observe that combining all the modalities along with GB and DB loss gives us the best results.
Window Sampling: In addition to these experiments, we explore how to pick the frames to feed into the visual network. For all the previous experiments, we use Gaussian Sampling by placing a Gaussian window centered around the cut and adjusting its size based on the length of the clip. However, this is not the only strategy to sample frames. We explore two other strategies: Uniform Sampling
which takes samples frames from a uniform distribution across the clip; andFixed Sampling which takes the 16 frames around the cut. We align the audio with the temporal span of the visual frames. We show the results of these three strategies on Table 2. The Gaussian strategy gives the best results (). Using Gaussian Sampling allows us to sample densely around the cut while including some context from the distribution’s tails. These results suggest that the most critical information lays around the cut but some context help the model to improve the results.
|AV + GB + DB Loss||Uniform||45.38|
|AV + GB + DB Loss||Fixed||45.55|
|AV + GB + DB Loss||Gaussian||45.72|
Test Set Results After obtaining the best-performing model on the validation set, we evaluate our best validation model on the test set. We obtained mAP, which is comparable to the results on the validation set mAP. To see the full test set results, refer to supplementary material.
4.4 Performance Breakdown
Attributes and dataset characteristics. Figure 5(a) summarizes the performance of our best AV model (GB + DB Loss) for different attributes and dataset characteristics. In most cases, the model exhibits robust performance across attributes. The largest performance gap is observed between Speech and Other sounds. We associate this result to the fact that cuts with complex audio editing, those that include sound effects, often employ abstract editing such as Smash Cuts and Match Cuts, which are harder for the model to recognize. The model is also robust across dataset characteristics such as cut duration or the year the cut was produced. However, we notice that the model’s performance drops significantly when there is a single label per cut. We observe that classes with multiple labels tend to belong to highly-represented classes in the dataset. However, single-label samples tend to belong to under-represented cut types in which the performance of the models is much lower. Thus, samples with multiple labels have higher accuracy than samples with single labels, as shown in figure 5(a). We hypothesize the DB Loss is pushing the model to score more than one cut type with high scores to the rest of the classes. These two findings suggest that there is room for improvement to push performance by studying better audio backbones and doing a deep-dive on the multi-label nature of MovieCuts.
Audio-visual improvements per cut type. Figure 5(b) shows the relative improvements of the audio-visual model using the visual stream only. The figure highlights whether the type of cut is driven by visual or audio-visual information. First, we observe that overall the audio-visual driven cuts benefit the most for training a joint audio-visual model. Second, the largest gains for the audio-visually-driven cuts are for those related to dialogue and conversations. For instance, L Cut’s AP improves 40%; such class typically involves an on-screen person talking in the first shot while only their voice is heard in the second shot. By encoding audio-visual information, the model disambiguates predictions that otherwise, with the visual model, would be confusing. Finally, all classes present relative improvement the visual baseline. This suggests that the Gradient Blending  strategy allows the model to optimize modalities’ weights, achieving, in the worst-case scenario, slight improvements over the visual baseline. In short, we empirically demonstrate the importance of modeling audio-visual information to recognize cut types.
4.5 Qualitative Results
We showcase representative qualitative results for the Cut on Action class on Figure 7. We observe that the first two cuts are correctly classified as cutting on action since the cut happens right after the action is performed (Gunshot and boxing punch). The third example is a false positive. The model wrongly predicts it as a Reaction Cut. The model fails gracefully; the shot focuses on the face of the actor right before the action, which is similar to what happens in Reaction cuts. At the end, the actor is not reacting but is performing an action across the cut. For more qualitative results, please refer to the supplementary material.
We introduced the task of cut type recognition and kick-started research in this new area by providing a new large-scale dataset accompanied by a benchmark of multiple audiovisual baselines. To construct the new dataset, we collected more than 170K annotations from qualified human workers. We analyzed the dataset diversity and uniqueness by studying its properties and audiovisual attributes. We proposed audiovisual baselines that include recent approaches that address the multi-modal and unbalance learning nature of the problem. While we set a strong research departing point, we hope that further research pushes the envelope of cut type recognition by leveraging MovieCuts. Acknowledgments This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding.
-  (1976) Grammar of the film language. Focal Press London. Cited by: Appendix B, §1, §3.1.
-  (2020) Condensed movies: story based retrieval with contextual embeddings. External Links: Cited by: §1, §2, §3.1.
-  (2016) Shot scale distribution in art films. Multimedia Tools and Applications 75 (23), pp. 16499–16527. Cited by: §2, §3.1.
-  (2013) Finding actors and actions in movies. In Proceedings of the IEEE international conference on computer vision, pp. 2280–2287. Cited by: §2.
-  (1993) Film art: an introduction. Vol. 7, McGraw-Hill New York. Cited by: §3.1.
-  (2019) Remembering winter was coming. Multimedia Tools and Applications 78 (24), pp. 35373–35399. Cited by: §2.
-  (2020) Serial speakers: a dataset of tv series. arXiv preprint arXiv:2002.06923. Cited by: §2.
-  (2020) Playing a part: speaker verification at the movies. arXiv preprint arXiv:2010.15716. Cited by: §1.
-  (2013) Classifying cinematographic shot types. Multimedia tools and applications 62 (1), pp. 51–73. Cited by: §1, §2, §3.1.
-  (2020) VGGSound: a large-scale audio-visual dataset. In International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Cited by: §3.3, §4.1, §4.2.
-  (2016) The evolution of pace in popular movies. Cognitive research: principles and implications 1 (1), pp. 30. Cited by: §1.
-  (2009) Automatic annotation of human actions in video. In 2009 IEEE 12th International Conference on Computer Vision, pp. 1491–1498. Cited by: §2.
-  (2006) Hello! my name is… buffy”–automatic naming of characters in tv video.. In BMVC, Vol. 2, pp. 6. Cited by: §1.
Ava: a video dataset of spatio-temporally localized atomic visual actions.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056. Cited by: §1, §2.
Ridiculously fast shot boundary detection with fully convolutional neural networks. In 2018 International Conference on Content-Based Multimedia Indexing (CBMI), pp. 1–4. Cited by: §3.1.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2.
-  (2018) Smart trailer: automatic generation of movie trailer using only subtitles. In 2018 First International Workshop on Deep and Representation Learning (IWDRL), pp. 26–30. Cited by: §2.
-  (2014) Thread-safe: towards recognizing human actions across shot boundaries. In Asian Conference on Computer Vision, pp. 222–237. Cited by: §1, §2.
-  (2018) Person search in videos with one portrait through visual and temporal links. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 425–441. Cited by: §2.
-  (2018-06) Unifying identification and context learning for person recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2020) MovieNet: a holistic dataset for movie understanding. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.
-  (2018) From trailers to storylines: an efficient way to learn from movies. arXiv preprint arXiv:1806.05341. Cited by: §2.
Caption-supervised face recognition: training a state-of-the-art face model without manual annotation. In The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2010) Automatic trailer generation. In Proceedings of the 18th ACM international conference on Multimedia, pp. 839–842. Cited by: §2.
-  (2005) The film encyclopedia. Collins. Cited by: §1.
-  (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §4.1, §4.2.
-  (2007) Anatomy of film. Kinema: A Journal for Film and Audiovisual Media. Cited by: §1.
-  (2008) Learning realistic human actions from movies. In 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §1, §2.
-  (2020) Multi-shot temporal event localization: a benchmark. arXiv preprint arXiv:2012.09434. Cited by: §1, §2.
-  (2017) A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6884–6893. Cited by: §2.
-  (2018) From benedict cumberbatch to sherlock holmes: character identification in tv series without a script. arXiv preprint arXiv:1801.10442. Cited by: §1.
-  (2020) Human mesh recovery from multiple shots. arXiv preprint arXiv:2012.09843. Cited by: §2.
-  (2021) Learning transferable visual models from natural language supervision. arXiv preprint arXiv:2103.00020. Cited by: Appendix B, Appendix B, §3.3, §3.3.
-  (2020) A unified framework for shot type classification based on subject centric lens. In The European Conference on Computer Vision (ECCV), Cited by: §1, §2.
-  (2020) A local-to-global approach to multi-modal movie scene segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10146–10155. Cited by: §2.
-  (2017) Movie description. International Journal of Computer Vision 123 (1), pp. 94–120. Cited by: §1, §2.
-  (2009) “Who are you?”-learning person specific classifiers from video. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1145–1152. Cited by: §2.
-  (2017) Harnessing ai for augmenting creativity: application to movie trailer creation. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1799–1808. Cited by: §2.
-  (2008) Edit blindness: the relationship between attention and global change blindness in dynamic scenes.. Journal of Eye Movement Research 2 (2). Cited by: Figure 1.
-  (2012) A window on reality: perceiving edited moving images. Current Directions in Psychological Science 21 (2), pp. 107–113. Cited by: Figure 1, §3.1.
-  (2012) The attentional theory of cinematic continuity. Projections 6 (1), pp. 1–27. Cited by: Appendix B.
-  (2016) MovieQA: Understanding Stories in Movies through Question-Answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
-  (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §4.2.
-  Note: http://www.cuvideoedit.com/types-of-edits.php Cited by: §3.1.
-  Note: https://filmanalysis.yale.edu/editing/#transitions Cited by: §3.1.
-  (2009) Cinemetrics, part of the humanities’ cyberinfrastructure. Cited by: §1, §1.
-  (2018) MovieGraphs: towards understanding human-centric situations from videos. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
-  (2009) Taxonomy of directing semantics for film shot classification. IEEE transactions on circuits and systems for video technology 19 (10), pp. 1529–1542. Cited by: §2, §3.1.
-  (2020) What makes training multi-modal classification networks hard?. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12695–12705. Cited by: Table 3, Table 4, Appendix C, §1, §4.1, §4.4, §4.4, Table 1, Table 2.
-  (2016) Analysing cinematography with embedded constrained patterns. In WICED-Eurographics Workshop on Intelligent Cinematography and Editing, Cited by: §2.
-  (2017) Analyzing elements of style in annotated film clips. In WICED 2017-Eurographics Workshop on Intelligent Cinematography and Editing, pp. 29–35. Cited by: §1, §2.
-  (2018) Thinking like a director: film editing patterns for virtual cinematographic storytelling. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14 (4), pp. 1–22. Cited by: §1, §2.
-  (2020) Distribution-balanced loss for multi-label classification in long-tailed datasets. In European Conference on Computer Vision, pp. 162–178. Cited by: Table 3, Table 4, Appendix C, §1, §4.1, §4.2, §4.4, Table 1, Table 2.
-  (2020) Online multi-modal person search in videos. In The European Conference on Computer Vision (ECCV), Cited by: §2.
-  (2019-10) A graph-based framework to bridge movies and synopses. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
Trailer generation via a point process-based visual attractiveness model.
Twenty-Fourth International Joint Conference on Artificial Intelligence, Cited by: §2.
Please visit: https://alejandropardo.net/publication/moviecuts/ for code and complete supplementary material.
Appendix A MovieCuts Annotation Consensus
To analyze the quality of MovieCuts’ annotations, we run an additional labeling campaign and measure the consensus between two disjoint sets of human annotators. We sample clips randomly from the initially annotated set towards this goal. After receiving the annotations, we then compute the number of responses in agreement concerning the first campaign. We notice that about of the times, both sets of workers agreed on the labels placed to each candidate’s cuts. While this can be seen as a high annotation quality ratio, we believe there is room for improvement to reduce the rate of missing labels. A potential solution for future annotation campaigns would be to increase the number of workers required to annotate each cut, which is currently set to three.
Appendix B MovieCuts Attributes Details
. To do so, we create language queries using relevant classes, for each attribute type, as a set of candidate text-visual pairs and use CLIP’s dual encoder to predict the most probable pair (the most probable tag). Thus, we compute an image embedding for the visual frames, and a text embedding for all candidate text queries (attribute tags) to then compute the cosine similarity between the L2-normalized embedding pairs. Instead of simply passing the tags to the language encoder, we augment the text queries using the following template: “a photo of asubject attribute”, and “an location attribute photo” for the subject and location attributes, respectively. We retrieve tags for each of its shots by sampling a random frame before and after the shot transition from each cut.
Actions that trigger Cuts. Our goal is to find correlations between action tags and cut types. To do so, we first build a zero-shot action classifier based on CLIP . Since the zero-shot action classifier did not offer us a high accuracy, we limit our analysis with the most confident tags only. Such tags allow us to find the most common co-occurrences between actions and cut-types. Figure 8 showcases three common action/cut pairs. These patterns are common across different movie scenes and editors’ styles. These empirical findings reaffirm the theory of the film grammar [1, 41], which suggests that video editing follows a set of rules more often than not.
Appendix C Additional Results
|AV + GB + DB Loss||Uniform||45.38||61.82||61.38||30.42||29.91||1.28||21.10||81.95||42.97||46.59||76.39|
|AV + GB + DB Loss||Fixed||45.55||62.66||60.25||32.11||28.73||1.86||21.82||81.51||42.39||48.60||75.51|
|AV + GB + DB Loss||Gaussian||45.72||63.91||63.04||33.24||29.60||1.84||22.11||81.20||41.61||45.91||74.70|
|AV + GB||Gaussian||44.95||62.82||61.80||30.78||29.72||1.69||23.12||81.85||39.90||44.00||73.81|
|AV + GB + DB Loss||Gaussian||45.32||63.90||62.38||30.96||29.33||2.15||23.93||81.04||39.95||45.14||74.45|
we showcase the PR curves for our best model on the validation set as an additional metrics to the ones shown in the main manuscript. We observe the Precision and Recall values for different confident thresholds for each one of the classes in MovieCuts.
Additionally, in table 4 we present the most important baselines reported on the test set. We notice that the same analysis made on the validation set applies on the test set. The visual signal is stronger than the audio signal. However, when combining them some classes improve drastically. Besides, the use of Gradient Blending  improve the naïve combination of modalities. Finally, the use of DB Loss  improves even further the results. The best model on both validation and test sets is obtained when combined all the techniques from all the other models.
Finally, on the attached slides there are several samples of cuts per category and some qualitative results.
Appendix D Addtional Stadistics
summarizes the difference in labels’ distribution across genres. To do this visualization we first calculate the average numbers of cuts for each of the classes, then, we plot the standard deviation from the classes’ mean for each of the genres. Thus, we visualize how frequent or infrequent is each of the classes depending on the movie genre. For instance, we observe that for genres like Romance, and Drama the classes Speaker Change, J-cuts and L-cuts are more frequent as compared to Action, and Adventure movies. However, for Action, and Adventure, Cross-cuts and Cuts on Action are more frequent as compared to Romance, and Drama.
Additional to figure 10, in figure Figure 11 we show the distribution of classes for the most represented genres for the different splits, train 10(a), validation 10(b) and testing 10(c). We see that the distributions across splits are independent and identically distributed (iid).