Imagine you are watching the movie Mission Impossible
starred by Tom Cruise: In a fight scene, Ethan leaps onto a helicopter’s landing skid and attaches an exploding gum to the windshield to destroy the enemy. Suddenly, the story jumps into an emotional scene where Ethan pulled the trigger and sacrificed his life to save his wife Julia. Such a dramatic change of scenes plays an important role in the movie’s storytelling. Generally speaking, a movie is composed of a well-designed series of intriguing scenes with transitions, where the underlying storyline determines the order of the scenes being presented. Therefore recognizing the movie scenes, including the detection of scene boundaries and the understanding of the scene content, facilitates a wide-range of movie understanding tasks such as scene classification, cross movie scene retrieval, human interaction graph and human-centric storyline construction.
It is worth noting that scenes and shots are essentially different. In general, a shot is captured by a camera that operates for an uninterrupted period of time and thus is visually continuous; while a scene is a semantic unit at a higher level. As illustrated in Figure 1, a scene comprises a sequence of shots to present a semantically coherent part of the story. Therefore, whereas a movie can be readily divided into shots based on simple visual cues using existing tools , the task of identifying those sub-sequences of shots that constitute scenes is challenging, as it requires semantic understanding in order to discover the associations between those shots that are semantically consistent but visually dissimilar.
There has been extensive studies on video understanding. Despite the great progress in this area, most existing works focus on recognizing the categories of certain activities from short videos [28, 6, 14]. More importantly, these works assume a list of pre-defined categories that are visually distinguishable. However, for movie scene segmentation, it is impossible to have such a list of categories. Additionally, shots are grouped into scenes according to their semantical coherence rather than just visual cues. Hence, a new method needs to be developed for this purpose.
To associate visually dissimilar shots, we need semantical understanding. The key question here is “how can we learn semantics without category label?” Our idea to tackle this problem consists in three aspects: 1) Instead of attempting to categorize the content, we focus on scene boundaries. We can learn what constitute a boundary between scenes in a supervised way, and thus get the capability of differentiating between within-scene and cross-scene transitions. 2) We leverage the cues contained in multiple semantic elements, including place, cast, action, and audio, to identify the associations across shots. By integrating these aspects, we can move beyond visual observations and establish the semantic connections more effectively. 3) We also explore the top-down guidance from the overall understanding of the movie, which brings further performance gains.
Based on these ideas, we develop a local-to-global framework that performs scene segmentation through three stages: 1) extracting shot representations from multiple aspects, 2) making local predictions based on the integrated information, and finally 3) optimizing the grouping of shots by solving a global optimization problem. To facilitate this research, we construct MovieScenes, a large-scale dataset that contains over scenes containing over shots from movies.
Experiments show that our method raise performance by (from to in terms of average precision) than the existing best method . Existing methods pretrained on our dataset also have a large gain in performance.
2 Related Work
Scene boundary detection and segmentation. The earliest works exploit a variety of unsupervised methods.  clusters shots according to shot color similarity. In , the author plots a shot response curve from low-level visual features and set a threshold to cut scene. [4, 3]10, 24] predict scene boundaries with dynamic programming by optimizing a predefined optimizing objective. Researchers also resort to other modality information, e.g.  leverages scripts with HMM,  uses low-level visual and audio features to build scene transition graph. These unsupervised methods are not flexible and heavily rely on manually setting parameters for different videos.
Researchers move on to supervised approaches and start to build up new datasets. IBM OVSD  consists of 21 short videos with rough scenes, which may contain more than one plot. BBC Planet Earth  comes from 11 Episodes of BBC documentaries.  generates synthetic data from Places205 . However, the videos in these datasets lack rich plots or storylines, thus limits their real-world applications. The number of test videos is so small that cannot reflect the effectiveness of the methods considering the vast variety of scenes. Additionally, their methods take shot as the analytical unit and implement scene segmentation in the local region recursively. Due to their lack of consideration of the semantics within a scene, it is hard to learn high-level semantics and achieve an ideal result.
Scene understanding in images and short videos.
Image-based scene analysis [31, 29, 9] can infer some basic knowledge about scenes, e.g. what is contained in this image. However, it is hard to tell the action from a single static image since it lacks contextual information around it. Dynamic scene understanding are further studied with seconds-long short videos [6, 14]. However, all these videos take single shot video without enough variations capturing the change of time and places compared to long videos.
Scene understanding in long videos.
and localizing and classifying the actions. MovieGraphs  focuses on the individual scene clips in a movie and the language structures of a scene. Some transition parts between scenes are discarded, making the information incomplete.
In order to achieve more general scene analysis that could be extended to videos with long time duration, we address scene segmentation in movies with our large-scale MovieScenes dataset. We propose a framework considering both the relationship among shots locally and the relationship among scenes globally using multiple semantic elements, achieving much better segmentation results.
3 MovieScenes Dataset
To facilitate the scene understanding in movies, we construct MovieScenes, a large-scale scene segmentation dataset that contains scenes derived by grouping over shots from movies. This dataset provides a foundation for studying the complex semantics within scenes, and facilitates plot-based long video understanding on the top of scenes.
3.1 Definition of Scenes
Following previous definition of scene [17, 4, 10, 24], a scene is a plot-based semantic unit, where a certain activity takes place among a certain group of characters. While a scene often happens in a fixed place, it is also possible that a scene traverses between multiple places continually, e.g. during a fighting scene in a movie, the characters move from indoor to outdoor. These complex entanglements in scenes cast more difficulty in the accurate detection of scenes which require high-level semantic information. Figure 2 illustrates some examples of annotated scenes in MovieScenes, demonstrating this difficulty.
The vast diversity of movie scenes makes it hard for the annotators complying with each other. To ensure the consistency of results from different annotations, during the annotation procedure, we provided a list of ambiguous examples with specific guidance to clarify how such cases should be handled. Moreover, all data are annotated by different annotators independently for multiple times. In the end, our multiple times annotation with the provided guidance leads to highly consistent results, i.e. high consistency cases in total, as shown in Table 1.
|Transit.||16,392 (76.5%)||5,036 (23.5%)||-|
|Non-trans.||225,836 (92.6%)||18,048 (7.4%)||-|
|Total||242,052 (89.5%)||23,260 (8.6%)||5,138 (1.9%)|
3.2 Annotation Tool and Procedure
Our dataset contains movies, and it would be a prohibitive amount of work if the annotators go through the movies frame by frame. We adopt an shot-based approach, based on the understanding that a shot222A shot is an unbroken sequence of frames recorded from the same camera. could always be uniquely categorized into one scene. Consequently, the scene boundaries must be a subset of all the shot boundaries. For each movie, we first divide it into shots using off-the-shelf methods . This shot-based approach greatly simplifies the scene segmentation task and speeds up the annotation process. We also developed a web-based annotation tool333A demonstrated figure of UI is shown in supplementary materials. to facilitate annotation. All of the annotators went through two rounds annotation procedure to ensure the high consistency. In the first round, we dispatch each chunk of movies to three independent annotators for later consistency check. In the second round, inconsistent annotations will be re-assigned to two additional annotators for extra evaluations.
3.3 Annotation Statistics
Large-scale. Table 2 compares MovieScenes with existing similar video scene datasets. We show that MovieScenes is significantly larger than other datasets in terms of the number of shots/scenes and the total time duration. Furthermore, our dataset covers a much wider range of diverse sources of data, capturing all kinds of scenes, compared with short films or documentaries.
Diversity. Most movies in our dataset have time duration between to minutes, providing rich information about individual movie stories. A wide range of genres is covered, including most popular ones such as dramas, thrillers, action movies, making our dataset more comprehensive and general. The length of the annotated scenes varies from less than to more than , where the majority last for . This large variability existing in both the movie level and the scene level makes movie scene segmentation task more challenging.444More statistical results are specified in the supplements.
4 Local-to-Global Scene Segmentation
As mentioned above, a scene is a series of continuous shots. Therefore, scene segmentation can be formulated as a binary classification problem, i.e. to determine whether a shot boundary is a scene boundary. However, this task is not easy, since segmenting scenes requires the recognition of multiple semantic aspects and usage of the complex temporal information.
To tackle this problem, we propose a Local-to-Global Scene Segmentation framework (LGSS). The overall formulation is shown in Equation 1. A movie with shots is represented as a shot sequence , where each shot is represented with multiple semantic aspects. We design a three-level model to incorporate different levels of contextual information, i.e. clip level (), segment level () and movie level (), based on the shot representation . Our model gives a sequence of predictions , where denotes whether the boundary between the -th and ()-th shots is a scene boundary.
In the following parts of this section, we will first introduce how to get , namely how to represent the shot with multiple semantic elements. Then we will illustrate the details of the three levels of our model, i.e. , and . The overall framework is shown in Figure 3.
4.1 Shot Representation with Semantic Elements
A scene is where a sequence of shots sharing some common elements, e.g. place, cast, etc. Thus, it is important to take these related semantic elements into consideration for better shot representation. In our LGSS framework, a shot is represented with four elements that play important roles in the constitution of a scene, namely place, cast, action, and audio.
To obtain semantic features for each shot , we utilize 1) ResNet50  pretrained on Places dataset  on key frame images to get place features, 2) Faster-RCNN  pretrained on CIM dataset  to detect cast instances and ResNet50 pretrained on PIPA dataset  to extract cast features, 3) TSN  pretrained on AVA dataset  to get action features, 4) NaverNet  pretrained on AVA-ActiveSpeaker dataset  to separate speech and background sound, and stft  to get their features respectively in a shot with 16K Hz sampling rate and 512 windowed signal length, and concatenate them to obtain audio features.
4.2 Shot Boundary Representation at Clip Level
As we mentioned before, scene segmentation can be formulated as a binary classification problem on shot boundaries. Therefore, how to represent a shot boundary becomes a crucial question. Here, we propose a Boundary Network (BNet) to model the shot boundary. As shown in Equation 2, BNet, denoted as , takes a clip of the movie with shots as input and outputs a boundary representation . Motivated by the intuition that a boundary representation should capture both the differences and the relations between the shots before and after, BNet consists of two branches, namely and . is modeled by two temporal convolution layers, each of them embeds the shots before and after the boundary respectively, following an inner product operation to calculate their differences.
aims to capture the relations of the shots, it is implemented by a temporal convolution layer followed a max pooling.
4.3 Coarse Prediction at Segment Level
After we get the representatives of each shot boundary , the problem becomes predicting a sequence binary labels based on the sequence of representatives , which can be solved by a sequence-to-sequence model . However, the number of shots is usually larger than , which is hard for existing sequential models to contain such a long memory. Therefore, we design a segment-level model to predict a coarse results based on a movie segment that consists of shots (). Specifically, we use a sequential model , e.g. a Bi-LSTM 
, with strideshots to predict a sequence of coarse score , as shown in Equation 3. Here
is the probability of a shot boundary to be a scene boundary.
Then we get a coarse prediction , which indicates whether the
-th shot boundary is a scene boundary. By binarizingwith a threshold , we get
4.4 Global Optimal Grouping at Movie Level
The segmentation result obtained by the segment-level model is not good enough, since it only considers the local information over shots while ignoring the global contextual information over the whole movie. In order to capture the global structure, we develop a global optimal model to take movie-level context into consideration. It takes the shot representations and the coarse prediction as inputs and make the final decision as follows,
The global optimal model is formulated as an optimization problem. Before introducing it, we establish the concept of super shots and objective function first.
The local segmentation gives us an initial rough scene cut set , here we denote as a super shot, i.e. a sequence of consecutive shots determined by the segment-level results . Our goal is to merge these super shots into scenes , where and . Since is not given, to automatically decide the target scene number , we need to look through all the possible scene cuts, i.e. . With fixed , we want to find the optimal scene cut set . The overall optimization problem is as follows,
Here, is the optimal scene cut score achieved by the scene . It formulates the relationship between a super shot and the rest super shots . constitutes two terms to capture a global relationship and a local relationship, is similarity score between and , and is an indicate function that whether there is a very high similarity between and any super shot from aiming to formulate shots thread in a scene. Specifically,
Solving the optimization problem and determining target scene number can be effectively conducted by dynamic programming (DP). The update of is
where is the set containing the first super shots.
The above DP could give us a scene cut result, but we can further take this result as a new super shot set and iteratively merge them to improve the final result. When the super shot updates, we also need to update these super shot representations. A simple summation over all the contained shots may not be an ideal representation for a super shot, as there are some shots containing less informations. Therefore, it would be better if we refine the representation of super shots in the optimal grouping. The details of this refinement on super shot representation are given in the supplements.
5.1 Experimental Setup
Data. We implement all the baseline methods with our MovieScenes dataset. The whole annotation set is split into Train, Val, and Test sets with the ratio 10:2:3 on video level.
Implementation details. We take cross entropy loss for the binary classification. Since there exists unbalance in the dataset, i.e
. non-scene-transition shot boundaries dominate in amount (approximate 9:1), we take a 1:9 weight on cross entropy loss for non-scene-transition shot boundary and scene-transition shot boundary respectively. We train these models for 30 epochs with Adam optimizer. The initial learning rate is 0.01 and the learning rate will be divided by 10 at the 15th epoch.
In the global optimal grouping, we take super shots from local segmentation according to the obtained classification scores for these shot boundaries (a movie usually contains shot boundaries.) The range of target scenes are from to , i.e.
. These values are estimated based on theMovieScenes statistics.
Evaluation Metrics. We take three commonly used metrics: 1) Average Precision (AP). Specifically in our experiment, it is the mean of AP of for each movie. 2) M: a weighted sum of intersection of union of a detected scene boundary with respect to its distance to the closest ground-truth scene boundary. 3) Recall@3s: recall at 3 seconds, the percentage of annotated scene boundaries which lies within 3s of the predicted boundary.
|Method||AP ()||()||Recall()||Recall@3s ()|
|Rasheed et al., GraphCut ||14.1||29.7||53.7||57.2|
|Chasanis et al., SCSA ||14.7||30.5||54.9||58.0|
|Han et al., DP ||15.5||32.0||55.6||58.4|
|Rotman et al., Grouping ||17.6||33.1||56.6||58.7|
|Tapaswi et al., StoryGraph ||25.1||35.7||58.4||59.7|
|Baraldi et al., Siamese ||28.1||36.0||60.1||61.2|
|LGSS (Multi-Semantics+BNet+Local Seq)||44.9||46.5||71.4||77.5|
|LGSS (all, Multi-Semantics+BNet+Local Seq+Global)||47.1||48.8||73.6||79.8|
5.2 Quantitative Results
The overall results are shown in Table 3. We reproduce existing methods [18, 4, 10, 21, 24, 1] with deep place features for fair comparison. The base model applies temporal convolution on shots with the place feature, and we gradually add the following four modules to it, i.e., 1) multiple semantic elements (Multi-Semantics), 2) shot boundary representation at clip level (BNet), 3) coarse prediction at segment level with a local sequence model (Local Seq), and 4) global optimal grouping at movie level (Global).
Analysis of overall results. The performance of random method depends on the ratio of scene-transition/non-scene-transition shot boundary in the test set, which is approximately . All the conventional methods [18, 4, 10, 21] outperform random guess, yet do not achieve good performance since they only consider the local contextual information and fail to capture semantic information. [24, 1] achieve better results than conventional methods [18, 4, 10, 21] by considering a large range information.
Analysis of our framework. Our base model applies temporal convolution on shots with the place feature and achieves on AP. With the help of multiple semantic elements, our method improves from (Base) to (Multi-Semantics) ( relatively). The framework with shot boundary modeling using BNet raises the performance from (Multi-Semantics) to (Multi-Semantics+BNet) ( relatively) which suggests that in the scene segmentation task, modeling shot boundary directly is useful. The method with local sequence model (Multi-Semantics+BNet+Local Seq) achieves absolute and relative improvement than model (Multi-Semantics+BNet) from to . The full model includes both local sequence model and global optimal grouping (Multi-Semantics+BNet+Local Seq+Global) further improves the results from to , which shows that a movie level optimization are important to scene segmentation.
In all, with the help of multiple semantic elements, clip level shot modeling, segment level local sequence model, and movie level global optimal grouping, our best model outperforms base model and former best model  by a large margin, which improves absolutely and relatively on base model (Base), and improves absolutely and relatively on Siamese . These verify the effectiveness of this local-to-global framework.
5.3 Ablation Studies
Multiple semantic elements.
We take the pipeline with shot boundary modeling BNet, local sequence model and global optimal grouping as the base model. As shown in Table 4, gradually adding mid-level semantic elements improves the final results. Starting from the model using place only, audio improves , action improves , casts improves , and improves with all together. This result indicates that place, cast, action and audio are all useful information to help scene segmentation.
Additionally, with the help of our multi-semantic elements, other methods [21, 24, 1] achieve relative improvements. This result further justifies our assumption that multi-semantic elements contributing to the scene segmentation.
Influence of temporal length. We choose different window sizes in the shot boundary modeling at clip level (BNet) and different sequence lengths of Bi-LSTM at segment level (Local Seq). The result is shown in Table 5. The experiments show that a longer range of information improves the performance. Interestingly, the best results come from 4 shots for a shot boundary modeling and 10 shot boundaries as the input of a local sequence model, which involves 14 shot information in total. This is approximately the length of a scene. It shows that this range of temporal information is helpful to scene segmentation.
|Iter #Init #||400||600||800||1000|
Choice of hyper-parameters in global optimal grouping. We differ the iteration number of optimization (Iter #) and the initial super shots number (Init #) and show the results in Table 6.
We first take a look at each row and change the initial super shots number. The setting with initial number achieves the best results, since it is close to the target scene number and meanwhile ensures enough large search space. Then, when we look at each column, we observe that the setting with initial number converges in the fastest way. It achieves the best results very quickly after iterations. And all the settings coverge within iterations.
5.4 Qualitative Results
Qualitative results showing the effectiveness of our multi-modal approach is illustrated in Figure 4, and the qualitative results of global optimal grouping are shown in Figure 5. 555More results are shown in the supplementary materials.
Multiple semantic elements.
To quantify the importance of multiple semantic elements, we take the norm of the cosine similarity for each modality. Figure4 (a) shows an example where the cast is very similar in consecutive shots and help to contribute to the formation of a scene. In Figure 4 (b), the characters and their actions are hard to recognize: the first shot is a long shot where the character is very small, and the last shot only shows one part of the character without a clear face. In these cases, a scene is recognized thanks to the similar audio feature that is shared among these shots. Figure 4 (c) is a typical “phone call” scene where the action in each shot is similar. In Figure 4 (d), only place is similar and we still conclude it as one scene. From the above observations and analysis on more such cases, we come to the following empirical conclusions: multi-modal information is complementary to each other and help the scene segmentation.
Optimal grouping. We show two cases to demonstrate the effectiveness of optimal grouping. There are two scenes in Figure 5. Without global optimal grouping, a scene with sudden view point change is likely to predict a scene transition (red line in the figure), e.g. in the first case, the coarse prediction gets two scene cuts when the shot type changes from a full shot to a close shot. In the second case, the coarse prediction gets a scene cut when a extreme close up shot appears. Our global optimal grouping is able to smooth out these redundant scene cuts as we expected.
5.5 Cross Dataset Transfer
We test different methods DP  and Siamese  on existing datasets OVSD  and BBC  with pretraining on our MovieScenes dataset, and the results are shown in Table 7. With pretraining on our dataset, the performances achieve significant improvements, i.e. absolute and relative improvements in AP. The reason is that our dataset covers much more scenes and brings a better generalization ability to the model pretrained on it.
In this work, we collect a large-scale annotation set for scene segmentation on movies containing annotations. We propose a local-to-global scene segmentation framework to cover a hierarchical temporal and semantic information. Experiments show that this framework is very effective and achieves much better performance than existing methods. A successful scene segmentation is able to support a bunch of movie understanding applications. 666More details are shown in the supplementary materials. All the studies in this paper together show that scene analysis is a challenging but meaningful topic which deserves further research efforts.
Acknowledgment This work is partially supported by the General Research Fund (GRF) of Hong Kong (No. 14203518 & No. 14205719) and SenseTime Collaborative Grant on Large-scale Multi-modality Analysis.
-  (2015) A deep siamese network for scene detection in broadcast videos. In 23rd ACM International Conference on Multimedia, pp. 1199–1202. Cited by: §1, §2, Table 2, §4.1, §5.2, §5.2, §5.2, §5.3, §5.5, Table 3, Table 4, Table 7.
Finding actors and actions in movies.
Proceedings of the IEEE International Conference on Computer Vision, pp. 2280–2287. Cited by: §2.
-  (2018) PySceneDetect: intelligent scene cut detection and video splitting tool. Note: https://pyscenedetect.readthedocs.io/en/latest/ Cited by: §2.
-  (2008) Scene detection in videos using shot clustering and sequence alignment. IEEE transactions on multimedia 11 (1), pp. 89–100. Cited by: §2, §3.1, §5.2, §5.2, Table 3.
-  (2019) Naver at activitynet challenge 2019–task b active speaker detection (ava). arXiv preprint arXiv:1906.10555. Cited by: §4.1.
ActivityNet: a large-scale video benchmark for human activity understanding.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970. Cited by: §1, §2.
-  (2005) Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks 18 (5-6), pp. 602–610. Cited by: §4.3.
-  (2018) Ava: a video dataset of spatio-temporally localized atomic visual actions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6047–6056. Cited by: §2, §4.1.
-  (2015) Visual semantic role labeling. arXiv preprint arXiv:1505.04474. Cited by: §2.
-  (2011) Video scene segmentation using a novel boundary evaluation criterion and dynamic programming. In 2011 IEEE International conference on multimedia and expo, pp. 1–6. Cited by: §2, §3.1, §5.2, §5.2, §5.5, Table 3, Table 7.
-  (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.1.
-  (2018) Unifying identification and context learning for person recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2217–2225. Cited by: §2, §4.1.
-  (2009) A novel role-based movie scene segmentation method. In Pacific-Rim Conference on Multimedia, pp. 917–922. Cited by: §2.
-  (2019) Moments in time dataset: one million videos for event understanding. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.
Using deep features for video scene detection and annotation. Signal, Image and Video Processing, pp. 1–9. Cited by: §2.
-  (2014) Linking people in videos with “their” names using coreference resolution. In European conference on computer vision, pp. 95–110. Cited by: §2.
-  (2003) Scene detection in hollywood movies and tv shows. In 2003 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2003. Proceedings., Vol. 2, pp. II–343. Cited by: §2, §3.1.
-  (2005) Detection and representation of scenes in videos. IEEE transactions on Multimedia 7 (6), pp. 1097–1105. Cited by: §5.2, §5.2, Table 3.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 91–99. External Links: Cited by: §4.1.
-  (2019) AVA-activespeaker: an audio-visual dataset for active speaker detection. arXiv preprint arXiv:1901.01342. Cited by: §4.1.
-  (2017) Optimal sequential grouping for robust video scene detection using multiple modalities. International Journal of Semantic Computing 11 (02), pp. 193–208. Cited by: §2, Table 2, §5.2, §5.2, §5.3, §5.5, Table 3, Table 4, Table 7.
-  (1998) Exploring video structure beyond the shots. In Proceedings. IEEE International Conference on Multimedia Computing and Systems (Cat. No. 98TB100241), pp. 237–240. Cited by: §2.
-  (2011) Temporal video segmentation to scenes using high-level audiovisual features. IEEE Transactions on Circuits and Systems for Video Technology 21 (8), pp. 1163–1177. Cited by: §1, §2, §3.2.
-  (2014) Storygraphs: visualizing character interactions as a timeline. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 827–834. Cited by: §2, §3.1, §4.1, §5.2, §5.2, §5.3, Table 3, Table 4.
-  (1999) Fitting the mel scale. In 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No. 99CH36258), Vol. 1, pp. 217–220. Cited by: §4.1.
-  (2018) Moviegraphs: towards understanding human-centric situations from videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8581–8590. Cited by: §2.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In ECCV, Cited by: §4.1.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §1.
-  (2016) Situation recognition: visual semantic role labeling for image understanding. In Conference on Computer Vision and Pattern Recognition, Cited by: §2.
-  (2015) Beyond frontal faces: improving person recognition using multiple cues. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4804–4813. Cited by: §4.1.
-  (2018) Places: a 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence 40 (6), pp. 1452–1464. Cited by: §2, §2, §4.1.