Log In Sign Up

From Trailers to Storylines: An Efficient Way to Learn from Movies

The millions of movies produced in the human history are valuable resources for computer vision research. However, learning a vision model from movie data would meet with serious difficulties. A major obstacle is the computational cost -- the length of a movie is often over one hour, which is substantially longer than the short video clips that previous study mostly focuses on. In this paper, we explore an alternative approach to learning vision models from movies. Specifically, we consider a framework comprised of a visual module and a temporal analysis module. Unlike conventional learning methods, the proposed approach learns these modules from different sets of data -- the former from trailers while the latter from movies. This allows distinctive visual features to be learned within a reasonable budget while still preserving long-term temporal structures across an entire movie. We construct a large-scale dataset for this study and define a series of tasks on top. Experiments on this dataset showed that the proposed method can substantially reduce the training time while obtaining highly effective features and coherent temporal structures.


page 1

page 3

page 5

page 8


Moviescope: Large-scale Analysis of Movies using Multiple Modalities

Film media is a rich form of artistic expression. Unlike photography, an...

Movie Description

Audio Description (AD) provides linguistic descriptions of movies and al...

A Dataset for Movie Description

Descriptive video service (DVS) provides linguistic descriptions of movi...

Learning Trailer Moments in Full-Length Movies

A movie's key moments stand out of the screenplay to grab an audience's ...

A Graph-Based Framework to Bridge Movies and Synopses

Inspired by the remarkable advances in video analytics, research teams a...

Convolutional Collaborative Filter Network for Video Based Recommendation Systems

This analysis explores the temporal sequencing of objects in a movie tra...

An Approach to Exascale Visualization: Interactive Viewing of In-Situ Visualization

In the coming era of exascale supercomputing, in-situ visualization will...

1 Introduction

“Promise me you’ll survive … That you won’t give up, no matter what happens. No matter how hopeless.”

This heartbroken moment in

Titanic, the epic romance-disaster film directed by James Cameron, has deeply moved everyone in front of the screen. Movies contain tremendous values – they are not only a form of entertainment, but also a rich medium that reflects our culture, society, and history. From the standpoint of computer vision research, they also constitute a valuable source of data for visual learning, from which we can learn how various phenomena, situations, and even feelings can be presented in a visual way.

The value of movies has been noticed by the computer science community since long ago. Over the past decades, numerous studies have been done to analyze movies from different perspectives. [1, 22, 10] study movie from its characters by either building a role-net of the characters or learning the hidden persona of the characters. [24] try to align movies and books, which is way towards story-like visual explanation. [16] set up a Q&A benchmark consists of more than questions with high semantic diversity and provides different sources of a movie, like plot, to learn visual question answering models. Yet, an important question regarding the movie data has rarely been explored – can we learn a vision model for movie understanding?

In this study, we aim to explore an effective approach to movie understanding, one that goes from low-level feature representation to high-level semantic analysis. Towards this goal, we are facing two significant challenges, namely the prohibitive cost in computation and annotation. First, current video analysis research [19, 15, 17, 20, 21, 2] focuses on short video clips,  those lasting for seconds or at most several minutes. In contrast, movies usually last for substantially longer,  one hour or more. For videos of this scale, even simple processing,  extracting CNN features, may take an unusually long time, let alone training a model thereon. On the other hand, vision models,  convolutional networks, require a large number of annotated samples to train. Obtaining annotated training samples, even for images, is widely known as a costly procedure. It goes without saying that it would be a formidable task to annotate movies, which contain much more complicated structures.

Our approach to tackle these difficulties is inspired by an important fact – movies often come with trailers. Trailers are short previews of movies, which often contains the most significant shots selected by professionals. Therefore, from a diverse set of trailers, one can see a wide range of representative shots and thus learn useful visual cues for movie analysis. Also, trailers are much shorter than movies, often lasting for less than five minutes. Hence, learning models from trailers should be affordable if done efficiently. Whereas a trailer preserve significant visual features of a movie, it losts a key aspect – the temporal structure. Particularly, a movie often presents a story in a natural way following a logical storyline, while a trailer is just a compilation of distinctive shots, which can be far from each other in the original timeline. Hence, we can not expect to learn the temporal structures from trailers.

With the complementary natures of movies and trailers in mind, we explore an alternative approach to efficiently learning from movies. Specifically, the proposed framework integrates two key modules: a visual analysis module learned from trailers, and a temporal analysis module learned from movies

but on top of the features extracted by the visual module. A key observation behind this design is that state-of-the-art models for visual extraction,  convolutional networks 


, are typically much heavier than temporal models,  Markov chain or recurrent networks 

[3]. Hence, by learning these components from different sources of data, we can maintain the computational cost at an affordable level while allowing the framework to capture long-time temporal structures across a movie. Further more, to train these models, we introduce two strategies: (1) mining meta-data,  the information about movie genres and plot keywords, to supervise the learning of the visual module; and (2) learning the temporal module in a self-supervised manner, that is, omitting parts of the chain and letting the model to predict them given the rest.

To support this study and facilitate future research along this direction, we construct LSMTD, a large-scale movie and trailer data set, which contains full movies and trailers. The total length of these videos is more than hours. Thereon we define a series of tasks to evaluate the capability of movie analysis methods, to choose the ()-th shot from the candidates given a shots sequence. Our experiments on this dataset showed that the proposed approach substantially reduce the training time, while still outperform the model trained in conventional ways on various tasks.

To sum up, the contributions of this work mainly lie in two aspects: (1) We propose an alternative way to learn models for movie understanding, where the visual module and the temporal analysis module are respectively trained on trailers and movies, using meta-data and self-supervised learning. (2) We construct a large-scale movie and trailer dataset, define a series of tasks to assess the capability in movie understanding, and thereon perform a systematic study to compare different training strategies.

2 Related Works

Studies on Movies.

Due to the rich content and meta information, movies have been a gold mine for AI research. Numerous attempts have been make to automatically understand movies from different aspects. A stream of work focuses on the movie characters [1, 22, 10], trying to understand the relationships among them based on textual materials. Zhu  [24] proposed a method to align movies and books, in order to learn high-level semantics therefrom. Tapaswi  [16] developed a Q&A benchmark, which suggests an alternative way to learn from movies, that is, via visual question answering. These works mostly rely on text-based information,  plot, subtitles, and scripts, to mine and learn the semantics. When they try to incorporate visual observations, they simply use the features extracted using a convolutional network trained on image-based tasks (rather than on the movies). This is largely due to the computational difficulties of training on the movies themselves.

Studies on Trailers.

There have also been studies done on trailers. These studies often focus on other problems,  face recognition 

[9]. There are also efforts trying to generate trailers for user-uploaded videos by learning from structures of movies [5, 8]. In [14, 23], the genres classification problem is proposed to be tackled with trailers. For this purpose, datasets with several thousand trailers have been constructed. These works are all based on the trailers themselves without considering their corresponding movies. It is noteworthy that movies and trailers have rarely been considered together in vision research. To our best knowledge, this work is the first practical approach that bridges trailers and movies and allows knowledges learned from trailers to be transferred to full movie analysis.

Video-based Recognition.

Video-based recognition is also an active area in computer vision. Related topics include action recognition and video summarization. Over the past several years, the action recognition task has witnessed remarkable progress [19, 15, 17, 20, 21, 2]

, thanks to the advances in deep learning. For this task, following earlier efforts presented in

[15] and [20], recent studies have gradually shifted from hand-crafted features [19] to representations learned with deep convolutional networks. A key issue in video analysis is the excessive computational cost. Recently, Temporal Segment Networks (TSN) [21] was proposed, which tried to tackle this problem with sparsely sampled frames. Another important family of work is video summarization, which aims to find the most representative frames or snippets to represent a video. This problem is usually solved using unsupervised or weakly supervised learning [11]. A comprehensive review of this topic is provided in [18]. Whereas the proposed model provides the functionality of identifying representative frames as a byproduct, our ultimate goal is different, that is, to learn powerful visual representations and the temporal structures of movies, instead of just choosing representative frames.

3 Large-Scale Movie and Trailer Dataset

Figure 2: Samples of LSMTD. The black words are the move titles, the red ones are their genres, and the blue ones are plot keywords.
Name MGCD[23] LMTD[14] LSMTD
trailers 1239 3500 34,219
movies - - 508
genres 4 22 22
plot keywords - - 33
duration(h) 42 118 2,200
Table 1: The basic statistics of several trailer dataset.

This work aims to learn models from both movies and trailers. For this purpose, existing datasets [23, 14] are limited. They contain several thousands of trailers along with some meta information,  genres, but full movies were not provided. To facilitate our study, we constructed a new dataset, named Large-Scale Movie and Trailer Dataset (LSMTD). This dataset contains movie trailers (around billion frames, million shots) and movies (around billion frames, million shots). The movies and trailers in LSMTD together contain over hours of video materials and more than billion frames. To our best knowledge, this is the largest dataset that has ever been built for visual analysis of movies, and one of the largest if taking all video datasets into account.

We purchase the movie DVDs from Amazon and other commercial channels, download the trailers from YouTube, collect their genres from IMDB111IMDB:, and user-provided plot keywords from TMDB222TMDB: Every trailer and movie in LSMTD is associated with at least one genre. Contrary to genres,  “Action” and “Drama”, plot keywords,  “love”, “escape”, and “alien”, are often more specific in describing the movie content. The plot keywords are very sparse – there are more than distinct keywords for movies, most of them appear for just one or two times. To obtain a meaningful set of keywords, we filter out infrequent keywords and merge synonyms and closely related concepts, such as “blood” and “gore”, which results in a diverse set of unique keywords.

Among all trailers and movies, movies and trailers are associated with at least one keyword in this set. Table 1 shows some statistics of the LSMTD in comparison with other datasets. Figure 2 shows several examples of the trailers together with their genres and plot keywords.

We plan incorporate extra meta-information into the dataset and will release the dataset to promote further study of movie analysis. Due to legal constraints, trailers and movies will be released in the form of urls to the sources.

4 Movie Analysis Framework

We aims to learn both visual information and temporal structures from the movies. What’s more, we want to learn it in a very efficient way. This means we do not expect to take all frames of an entire movie in one step of learning, which is both prohibitively expensive (due to the sheer volume of data contains in a movie) and unnecessary (frames in a movie are highly redundant). Instead, we take shots as the units, motivated by the observation that frames within a shot are highly similar.

Based on shots, the proposed framework decompose the task of learning into two parts, that is, to learn visual representations from trailers and temporal structures from movies. The rationales behind this design mainly consist in two aspects. On one hand, trailers are much shorter than the movies while usually containing the most significant shots of the movies. This condensed collection of distinctive shots forms a good basis where we can learn reasonable representation from within. On the other hand, a trailer is usually formed by shots sparsely selected from the a movie. Therefore, it does not preserve the temporal structures of the original storyline, which is also an important aspect in movie understanding. To capture the temporal structures, we formulate an LSTM-based model on top of the visual features, and learn it in a self-supervised way. In what follows, we will describe these two components in turn.

4.1 Learning Visual Representations from Trailers

Figure 3: Learning Visual Representations from Trailers.
Figure 4: Learning Temporal Structures from Movies.

We start with learning a visual model from trailers. Specifically, the model is expected to output visual features to represent shots of the movies. Therefore it can be seen as a shot encoder. Following previous work in video analysis [21], we formulate the model as shown in Figure 3, which encodes a shot in two steps: deriving frame-wise features via a convolutional network, and then combining them into a shot-based representation. The key question here is how to learn this model effectively and efficiently.

Shot-based Representations.

While trailers are generally much shorter than a full movie,   minutes vs minutes, they are still substantially longer than the video clips used in much of the previous work on video analysis, which are usually less than seconds. Therefore, on trailers, frame-by-frame analysis is still excessively expensive. However, unlike ordinary videos such as those from surveillance systems, a trailer usually comprises a sequence of distinctive shots, each lasting for a few seconds. The frames in a shot often look quite similar. This special property in temporal structure suggests an efficient way to analyze trailers, that is, to consider a trailer as a sequence of shots and sample frames sparsely in each shot.

Specifically, given a trailer, we first use an external tool to partition it into a sequence of coherent shots [13]: . From each shot , we sample frames: . and use a feature extractor to derive features for the sampled frames, denoted as , where is the parameter of the feature extractor. Then, we can derive a shot-based representation by average pooling:


Considering the strong expressive power of convolutional neural networks (CNNs) in visual representation, we suggest using a CNN for feature extraction. Particularly, we use BN-Inception 

[4] in our experiments, which we empirically found as a good balance between performance and cost. Also, we set , which allows more robust representation as opposed to a single frame. We also found that increasing to a greater value is unnecessary, due to the high redundancy among frames.

Training Visual Models

We rely on associated tags,  genres and plot keywords, to supervise the learning of the visual model. These tags are attached to the entire movie/trailer instead of individual shots. Hence, to learn from such tags, we need first combine the shot-based representations to generate video-level predictions.

It is noteworthy that movie producers usually choose the most representative shots for a trailer. Therefore, the shots in the trailer are probably relevant to the genres/keywords. This observation suggests a simple scheme to learn the model. Specifically, we can just sample

shots from a trailer and aggregate their features via another level of average pooling:


Here, denotes the trailer, and denote the indexes of the sampled shots, and we chose

to fit the training process in a GPU. Then, via a fully connected layer and a softmax layer, we can turn

into categorical predictions over the tags, and consequently the model can be trained based on the groundtruth tags.

Discussion: Here, we do not assume that each of the sampled shot is pertinent to the given tags. As long as a tag is relevant to a few of the samples, it will get a high prediction score – this is very likely due to the nature of trailers. In a certain sense, the process of choosing shots to form a trailer, which is done by the movie producer, can be considered as a weak form of supervision, and we are leveraging it for free. In practice, we found that the scheme described above, while simple, is very effective.

Train Data Score Average Feature + LSTM
Genres Keywords Genres Keywords
recall@3 MAP recall@3 MAP recall@3 MAP recall@3 MAP
Image-base - - - - 0.477 0.472 0.199 0.127
Movie 361 0.433 0.342 0.192 0.103 0.432 0.440 0.181 0.107
Trailer 361 0.421 0.430 0.154 0.126 0.435 0.446 0.196 0.123
Trailer 2K 0.559 0.538 0.222 0.128 0.491 0.513 0.217 0.128
Trailer 10K 0.586 0.587 0.245 0.131 0.531 0.523 0.228 0.113
Trailer 33K 0.582 0.596 0.248 0.139 0.528 0.538 0.236 0.139
Table 2: Tag Classification Results.
Name Duration(h) shots(k)
Movie 361 770 657
Trailer 361 13 21
Trailer 2K 70 117
Trailer 10K 370 587
Trailer 33K 1121 1761
Table 3: The duration and shots of different training sets.

4.2 Learning Temporal Structures from Movies

In addition to the visual representations, temporal structure is also a very important aspect of a movie. As discussed before, this structure is mostly lost in the trailers. Therefore, to learn this structure, we still need to rely on the movies themselves. Fortunately, we do not need to start from scratch – the visual models learned from trailers have already provided a powerful encoder for visual information. Hence, we can model the temporal structures on top of these visual representations.

Our temporal structure model is based on Long Short Term Memory (LSTM) [3], which has been shown to be a very effective model of sequential structures. In particular, it is able to preserve long-range dependencies while being sensitive to short-term changes, thus it very suited for capturing the temporal structures of movies. This LSTM formulation takes a sequence of movie shots encoded by our visual model as input, while trying to capture the semantics via the latent states.

To learn the LSTM model, we propose a self-supervised learning scheme, which is motivated by a question: how can know whether a model really captures the sequential structure? A natural way to is to ask the model to predict what happens next conditioned on the preceding observations. In our context, this can be realized by requesting the model to predict the -th shot given the preceding ones. We call this task next shot prediction. However, synthesizing the frames of the next shot is a very difficult problem and it is not directly relevant to our goal – we are interested in the high-level semantics. We circumvent this issue by reformulating the task next shot prediction as a multi-choice Q&A problem. The task now becomes “given a sequence of shots, choose the best -th shot from a pool of candidates”. The pool of candidates should comprise both the correct answer and a series of distracting options. The distracting options can be chosen from either the same movie or other movies, which can influence the difficulty of the task – the former is obviously more difficult, as the options from the same movies tend to be more deceiving.

Generally, a multi-choice Q&A problem can be formulated as a three-way scoring function [16], which we denote by , where, denotes the question, denotes a candidate answer, and is condition, which includes other observations besides the questions and the answers. Then, question answering can be performed by finding the answer with the maximum score among all provided choices, as:


For our specific problem of “shot prediction”, the condition is the sequence of preceding shots, is one of the candidate shots.

We formulate the scoring function as an LSTM constructed on discriminative CNNs, as shown in Figure 4. The overall pipeline can be briefly described as follows. On top of the shot-based features derived from our visual model, we construct an LSTM, where each time step corresponds to a shot in the movie sequence. Particularly, at each step, an LSTM unit takes both the preceding states and the visual feature of the current shot as input, and yield an -dimensional feature, denoted as , as an output, which encodes the model’s understanding over all the shots that it has seen. On the other hand, each candidate shot is characterized by its own visual feature of dimension .

We repeat by times and concatenate them with the candidate shot representations, thus forming a matrix of size , where each row correspond to a candidate combined with the condition . This combined representation, through a series of

convolution, will be distilled into a score vector of length

. Finally, via a softmax layer, these scores will be converted into normalized probability values, one for each choice. We can then learn the network by maximizing the log-probabilities of the correct answers.

Discussion: This framework has two noteworthy aspects. First, the LSTM framework is based on the shot-based representations learned from the trailers. These features can be extracted at the beginning and cached, and we won’t update the underlying CNN during training. Therefore, it can be trained very efficiently even on entire movies. Second, via the next-short prediction problem, the training can be self-supervised, without the need of any external annotations to provide supervisory signals.

5 Experiment Result

The primary goal of this work is to develop a method that can learn visual representations and temporal structures from both trailers and movies. To test the capability of the learned models, we set up two benchmark tasks, namely tag classification and shot prediction, and testing our framework on Movie QA benchmark[16] in order to evaluate the models from multiple perspectives.

We proposed to learn the visual model from trailers instead of directly from the movies. To validate the effectiveness of this approach, we compare different configurations, where the models are respectively learned from movies and different subsets of the trailers. All these models adopt the same BN-inception architecture [4].

In our experiments, the movies from LSMTD are randomly divided into three sets, for training (named “Movie ”), for validation (named “Movie ”) and for testing (named “Movie ”). As for the trailers, we first filter out the trailers whose movies are in the validation set and testing set. The remaining ones form the set “Trailer 33K”, which contains around trailers. From this whole training set of trailers, we randomly sample and trailers and construct two subsets “Trailer 2K” and “Trailer 10K”. We also construct a special subset “Trailer 361”, which contains the trailers corresponding to the training movies. In conclusion, we get five different set of data for training, as summarized in Tab. 3. These subsets are constructed to investigate the effect of number of trailers on the quality of visual models, and compare the effectiveness of trailer-based and movie-based training.

5.1 Tag Classification

The first task is tag classification. For each movie, we ask the models to predict both its genres and plot keywords. We construct our visual models as described in 4.1 and train it with both genres and plot keywords under a multi-task setting. We train the visual models on the training sets described above, obtaining different models. During training, we randomly sample shots of a video (movie or trailer), and extract frames of each shot. The output is obtained by averaging the response of all frames. During testing, we try two different ways to predict the tags: (1) Score Average: average all the predictions from all sampled shots to get the final prediction. (2) Feature+LSTM

: construct an LSTM on top of shot-based representations, and train the LSTM to encode the shot sequence. The outputs of all time steps of the LSTM are averaged to obtain the final prediction. For the Feature+LSTM approach, in addition to the five models trained on video subsets, we also consider a setting that uses the features extracted directly from a CNN trained on ImageNet 

[12] (without finetuning on LSMTD). This setting is referred to as “Image-base”.

Tab. 2 summarizes the results. We can see that the visual model trained on the “Trailer 33K” performs the best, on both genres and plot keyword predictions. This demonstrates that as the number of trailers increase, the quality of learned visual model would also improve. Interestingly, models learned on “Movie 361” get comparable or slightly worse results compared with those on “Trailer 361”. When compared with the models trained on more trailers (the overall cost is still not as high), the performance of “Movie 361” falls way behind. This observation clearly suggests that in terms of learning visual representations, training on trailers, which often contain the most distinctive shots, is more effective than training directly on movies.

5.2 Movie Q&A

Figure 5: This figure shows the overall structure of our Q&A model. At first, the features of the video clips ( dimension) and the embedding of the question ( dimension) are concatenated. Then we replicate the concatenated vector for times and combine them with the embeddings of the choices (answer candidates) of dimension , which would result in a matrix of size . Then we feed it to a CNN with three convolutional layers followed by a softmax layer, and finally obtain a probability vector over the given choices.

We also tested the visual model on the MovieQA benchmark dataset [16], to evaluate how relevant the learned visual representations are to semantic understanding. The MovieQA dataset consists of questions about over movies, with one correct answer from choices each. It provides multiple sources of information, like movie clips, plots, subtitles and etc, but not all the movies contain video clips. As our goal here is to evaluate the visual representations, we train and test on the subset containing video clips, which consists of questions, for training, for validation and for testing.

This Q&A problem is the same as what we formulated in Eq.(3), where and are the embeddings of the question and the answers, and is the embedding of the video clip aligned with the question. We use word2vec [7] to embed and and the feature extracted by our visual model as the embedding of the video clips. The overall pipeline of our Q&A model are shown in Figure 5.

Tab. 4 shows the testing results from “Movie 361” and “Trailer 33K” and compare them with SSCB, the baseline provided in [16]. It can be seen that the features from the models learned on trailers outperform the one on movies. This again corroborates our hypothesis – training on trailers is effective for learning visual representations for movies.

Model Accuracy
SSCB [16] 22.18
Movie 361 23.45
Trailer 33K 24.32
Table 4: Movie QA Results.

5.3 Shot Prediction

Model In Movie Cross Movie
Average LSTM Average LSTM
Movie 361 0.185 0.364 0.713 0.795
Trailer 361 0.181 0.362 0.637 0.792
Trailer 2K 0.172 0.367 0.618 0.796
Trailer 10K 0.161 0.399 0.551 0.817
Trailer 33K 0.154 0.401 0.529 0.825
Table 5: Shot Prediction Accuracy.
Figure 6: Shot Retrieval. The left part are the shots with the high responses to “Romance”(first row) and “War”(second row) in Pearl Harbor. The right part the shots with high response to “Sport”(first row) and “Music”(second row) in High School Musical.
Figure 7: Response to the plot keywords in Titanic. The three plots are the response from visual model trained on“trailer 33K” followed by a LSTM, visual model trained on “trailer 33K” and visual model trained on “movie 361”. From the comparison of these plots, we can see that learning visual information from trailers do better than from movies and we can reconstruct the temporal structure by a self-supervised LSTM.

Based on LSMTD, we define a shot prediction task to evaluate the capability of the temporal structure model in analyzing movies. Given a sequence of shots, we expect the temporal model to predict the -th shot. As described in 4.2, we design this task as a multi-choice Q&A problem. The question is “which is the -th shot given a sequence of shots” . The distractive choices are shots randomly picked from either the same movie or other movies.

Based on this task, we construct a benchmark on shot prediction, upon the movies of LSMTD. This benchmark has two settings: (1) “In Movie”, where the distracting answers are other shots within the same movie, and (2) “Cross Movie”, where the distracting answers are sampled from all movies. Each setting has questions, and each question comes with a correct answer and deceiving ones. These questions are divided into three disjoint sets: (1) training: questions from “Movie ”, (2) validation: questions from “Movie ”, and (3) testing: questions from “Movie ”.

We compare our temporal model described in 4.2 with a straightforward baseline. This baseline simply averages the features of the given shots and calculates the cosine distances between the averaged feature vector and those from the query shots. The results are summarized in Tab. 5. We can see that our temporal model significantly outperforms the baseline. This shows that our model, in a certain way, does capture the temporal structures, which give it the improved ability to make predictions on next shots. We also notice that higher accuracies on the “Cross Movie” setting as compared to the “In Movie” setting. This is because the shots in the same movie are usually more confusing than those from other movies as they share similar visual styles and movie characters, etc. Again, we still observe that the visual models learned from trailers perform better than the models trained on movies, and that the performance increases as the numbers of training trailers grows. These observations are in accordance with our earlier observations.

5.4 Visualization

Finally, we evaluate whether the visual models capture the connections between semantic concepts and visual observations via a qualitative study. In this study, we perform shot retrieval. We select some movies and try to find the shots with the highest responses to the tags. Figure 6 show some representative results. We can see that the shots with high responses to the query tags are indeed highly relevant, which shows that the visual models have effectively learned the semantics behind the visual observations.

For comparison, Figure 7 shows the responses of the shots in Titanic to two plot keywords, “love” and “dystopia”, based on three models respectively. We can observe accurate responses produced by the model trained on “Trailer 33K”, while the one trained on “Movie 361” failed to capture these semantic concepts, and producing poorly aligned responses. More results will be provided in the supplemental materials.

6 Conclusions

This paper presented an efficient approach to learn visual models from movies. Particularly, it learns the visual representations from trailers, taking advantage of the trailers’ distinctive natures, and learns the temporal structures from movies via a self-supervised formulation. We collected a large scale movie and trailer dataset, which contains over trailers and movies, and defined two tasks thereon, namely tagging and shot prediction, to evaluate a model’s capability in understanding a given movie purely from visual observations. We also test our framework on the movie Q&A task. Experimental results on all three tasks consistently showed the effectiveness of the proposed framework.


  • [1] D. Bamman, B. O’Connor, and N. A. Smith. Learning latent personas of film characters. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), page 352, 2014.
  • [2] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 2625–2634, 2015.
  • [3] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [4] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32nd International Conference on Machine Learning (ICML-15)

    , pages 448–456, 2015.
  • [5] H.-W. Kang, Y. Matsushita, X. Tang, and X.-Q. Chen. Space-time video montage. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pages 1331–1338. IEEE, 2006.
  • [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [7] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • [8] H. Oosterhuis, S. Ravi, and M. Bendersky. Semantic video trailers. In ICML 2016 Workshop on Multi-View Representation Learning, 2016.
  • [9] E. G. Ortiz, A. Wright, and M. Shah. Face recognition in movie trailers via mean sequence sparse representation-based classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3531–3538, 2013.
  • [10] S.-B. Park, Y.-W. Kim, M. N. Uddin, and G.-S. Jo. Character-net: Character network analysis from video. In Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology-Volume 01, pages 305–308. IEEE Computer Society, 2009.
  • [11] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid. Category-specific video summarization. In European conference on computer vision, pages 540–555. Springer, 2014.
  • [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  • [13] P. Sidiropoulos, V. Mezaris, I. Kompatsiaris, H. Meinedo, M. Bugalho, and I. Trancoso. Temporal video segmentation to scenes using high-level audiovisual features. IEEE Transactions on Circuits and Systems for Video Technology, 21(8):1163–1177, 2011.
  • [14] G. S. Simões, J. Wehrmann, R. C. Barros, and D. D. Ruiz. Movie genre classification with convolutional neural networks. In Neural Networks (IJCNN), 2016 International Joint Conference on, pages 259–266. IEEE, 2016.
  • [15] K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, pages 568–576, 2014.
  • [16] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through question-answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [17] D. Tran, L. D. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
  • [18] B. T. Truong and S. Venkatesh. Video abstraction: A systematic review and classification. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 3(1):3, 2007.
  • [19] H. Wang and C. Schmid. Action recognition with improved trajectories. In ICCV, pages 3551–3558, 2013.
  • [20] L. Wang, Y. Qiao, and X. Tang. Action recognition with trajectory-pooled deep-convolutional descriptors. In CVPR, pages 4305–4314, 2015.
  • [21] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In Computer Vision – ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pages 20–36, 2016.
  • [22] C.-Y. Weng, W.-T. Chu, and J.-L. Wu. Rolenet: Movie analysis from the perspective of social networks. IEEE Transactions on Multimedia, 11(2):256–271, 2009.
  • [23] H. Zhou, T. Hermans, A. V. Karandikar, and J. M. Rehg. Movie genre classification via scene categorization. In Proceedings of the 18th ACM international conference on Multimedia, pages 747–750. ACM, 2010.
  • [24] Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, pages 19–27, 2015.