Comprehensive Video Understanding: Video summarization with content-based video recommender design

10/30/2019
by   Yudong Jiang, et al.
0

Video summarization aims to extract keyframes/shots from a long video. Previous methods mainly take diversity and representativeness of generated summaries as prior knowledge in algorithm design. In this paper, we formulate video summarization as a content-based recommender problem, which should distill the most useful content from a long video for users who suffer from information overload. A scalable deep neural network is proposed on predicting if one video segment is a useful segment for users by explicitly modelling both segment and video. Moreover, we accomplish scene and action recognition in untrimmed videos in order to find more correlations among different aspects of video understanding tasks. Also, our paper will discuss the effect of audio and visual features in summarization task. We also extend our work by data augmentation and multi-task learning for preventing the model from early-stage overfitting. The final results of our model win the first place in ICCV 2019 CoView Workshop Challenge Track.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 5

09/28/2016

Video Summarization using Deep Semantic Features

This paper presents a video summarization technique for an Internet vide...
02/11/2021

Audiovisual Highlight Detection in Videos

In this paper, we test the hypothesis that interesting events in unstruc...
07/17/2017

Show and Recall: Learning What Makes Videos Memorable

With the explosion of video content on the Internet, there is a need for...
01/20/2020

Audio Summarization with Audio Features and Probability Distribution Divergence

The automatic summarization of multimedia sources is an important task t...
08/13/2015

Generation of Multimedia Artifacts: An Extractive Summarization-based Approach

We explore methods for content selection and address the issue of cohere...
11/25/2019

Visual Summarization of Scholarly Videos using Word Embeddings and Keyphrase Extraction

Effective learning with audiovisual content depends on many factors. Bes...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: The whole picture of our video summarization system. Feature extractor can be a model trained on different video understanding tasks. Summarization system use viewers’ feedback and features from different source to generate the video summary. Users’ feedback is treated as supervised signals but not a necessary condition in the prediction process.

As information overload becomes a more and more serious topic in modern society, many efficient tools have been designed to overcome “information anxiety”. Videos, the fastest-growing information carrier, will account for more than 80% of all Internet traffic by 2020 [3]. Video summarization aims to address this problem by extracting keyframes/shots from a long video, which contains the most useful content for people. It serves as a vital way of comprehensively understanding video data in research area while saving viewers’ time on information acquisition. This technology also gains the attention of both the academic world and the industry. Adobe has already presented video summarization feature in their products for video editing. Some cloud computing companies, like Microsoft Azure, AliCloud, provide this function as an online service.

One of the main challenges in video summarization is subjectivity, different people may have different selections on key shots for a same video. We investigate two top referenced dataset SumMe [6] and TVSum [15], which both provide manually curated consistency benchmark, with F1 score of 0.31 for SumMe, F1 score of 0.36 for TVSum. Based on the assumptions above, we don’t take diversity or representativeness and other subjective attributes into consideration, instead, we focuse on solely the key shots and the full video. The mission of our algorithm is to find the most attractive video segments for a group of annotators/users when they watch a long video.

Content-based recommender modelling is one of the most important technology in recommendation system, especially for alleviating cold start problem, which recommend items with similar content to what users like. It is usually formalized as a similarity learning problem [13]. A content-based video recommender deep model learns compact representation of videos and constructs a bridge between users’ feedback and video semantics information. We develop and deploy the HighlightNet to learn annotators’ preferences, the model is supervised by the segment’s importance feedbacks. It can easily combine different inputs, such as raw features, high-level features, audio features and vision feature together under this framework. The holistic picture of our work is shown in figure 1.

Currently, as video summarization problem appeals to a lot of researchers, many state-of-the-art approaches have been presented in solving this problem [3, 24, 25, 26]. Generally, they treat video summarization as sequence-to-sequence learning problem. Since RNN and its variants LSTM, GRU are very efficient in modelling long time dependencies under encoder-decoder architectures, many works in machine translation, image/video captioning, reading comprehension adopt these technologies. In this paper, GRU modelling is adopted in order to consider both the isolated segment, and its roles in the whole video.

Video summarization, as one of the comprehensive video understanding tasks, is extremely difficult and requires a large amount of data in deep learning architecture. However, the collection of such summarization labels is time-consuming and labor-intensive, resulting in an insufficient dataset. Since supervised learning can easily overfit on small data, many methods have been explored on alleviating this issue. Many state-of-the-art videos summarization work address this problem by unsupervised learning, semi-supervised learning, or multi-task learning 

[25, 26, 28]. We explore the self-supervised learning on video sequence modelling and joint training with segments importance score prediction. Another way to confront overfitting is to shuffle information flows which potentially improves the model’s generalization ability. The contribution of our work:

[1] We unify various inputs on different semantic levels to one framework by formatting video summarisation into a recommender problem.

[2] We develop an algorithm that models independent segments and segment sequence(whole video).

[3] We extend the summarisation framework with self-supervised learning and data augmentation to deal with lack of labelled data.

2 Related Work

Video classification, as a fundamental task for video understanding, has been studied seriously. Many high-quality datasets [1, 10, 11, 16]are published which drives the research to a higher level. Supervised learning brings the performance of algorithms nearly to human performance on large-scale video classification tasks [4, 7, 9, 18, 19, 20, 21].

A significant number of deep learning based frameworks have been explored recently in solving video summarization [3, 22, 24, 26, 28]. K. Zhang et al. creatively applied LSTM in supervised video sequence labelling to model video temporal information with good performance [24]. Jiri Fajtl et.al introduced self-attention instead of RNN to enhance computation efficiency [3]

. K. Zhou et al. showed that fully unsupervised learning can outperform many supervised methods by considering diversity and representativeness in reinforcement learning-based framework 

[28]. Y. Zhang et al. introduced adversarial loss to video summarization which learns a dilated temporal relational generator and a discriminator with three-player loss [26].

Researchers believe learning good visual representation can help deep neural network improve both fitting and generalization ability, especially in classification task [5, 12, 14, 17]

. Some video summarization work adopts unsupervised learning as an auxiliary task to help to improve the supervised learning system’s performance. K. Zhang et al. used “retrospective encoder” that embeds the predicted summary and original video into the same abstract semantic space with closer distance. The semi-supervised settings help increase F-score on TVSum dataset from 63.9% to 65.2% 

[25]. K. Zhou et al. on the other hand, extending their reinforcement learning based framework to semi-supervised style elevated performance on both SumMe and TVSum comparing to purely supervised or unsupervised methods [28].

For the subjectivity of video summarization task, Kanehira et al. provided a solution by building a summary that depends on the particular aspect of a video the viewer focuses on [8]. Joonseok Lee et.al from Google published a content-only video recommendation system which we regarded as a work close to video summarization [13].

3 Approach

In this section, we detail our method on CoView 2019 comprehensive video understanding challenge track. First, we simply introduce our work on action and scene recognition in untrimmed video. Second, we formalize the summarization problem and present our solution based on deep neural network. Third, we talk about some important technology on how to prevent the DNN from early-stage overfitting.

3.1 Action and Scene Recognition in Untrimmed Video

We first adopt I3D with non-local blocks video classification [2, 21] in this subtask. A ResNet101 backbone is used for this framework. We tried two ways of training:

1. Action and scene recognition task sharing the same backbone with two SoftMax loss branches for each classification tasks.

2. We get one model each for action and scene recognition.

Since the training data just comes from only 1000 videos, the scene data for training is too small and lacks variation for recognition. There are many public image scene classification datasets such as Place360 [27] and SUN [23]

for image scene recognition. We assume the video scene classification task can be trained by image classification without much information loss. We can easily make full use of public datasets to potentially improve models’ robustness. We use average pooling operation on image classification predictions for final video level results. To capture the advantages of both video-based and image-based models, we ensemble the output scores of two models together with logistic regression.

3.2 Video Summarization Framework

We take the video summarization task as a content-based recommender problem. Let

be the feature matrix for a video m with segments in dimension.

Figure 2: The Video Summarization Network structure. It contains three subnetworks including: SegNet, VideoNet, and HighlightNet.

For each segment we have a mean importance score as users’ feedback, can be written by

Our goal is to select segments from the video by Top-k highest importance prediction score as summarization set . So the problem can be defined as finding a ranking function that can predict the segment importance score in video segments sequence

A loss function can be defined by mean-square error(MSE),

where H is the number of videos in training set, our algorithm is to find optimal function that minimizes the overall loss:

in the prediction model space .

Since lacks information from the whole video sequence as the input of , we set defined by

is the operator that concatenate each row of to , is a matrix that represent dimension whole video feature to each segment, which can also be taken as a learnable function of . In this work, we set the same value for each row of .

We can also get sequence descriptors for one segment, like image frame, audio frame, a segment is defined by

Another learnable function is used to map frames sequence features from to space. We combine image based feature and segment based feature by learnable function to final segment-level feature

We can minimize , , , in one framework by minimizing

We design a video summarization network which consists of three subnetworks, SegNet for function and , VideoNet for , and HighlightNet for as shown in figure 2.

As we utilize ImageNet features for each sampled video frame, SegNet combines all frame features’ sequence to a frame-based vector. SegNet also fuses frame-based feature and segment-level feature together. The SegNet is designed to process either 2D or 3D frame sequence. For 2D features, we adopt temporal-convolution and pooling to squeeze temporal dimension. And for 3D features, we use three 3D convolution blocks. The 2D convolution blocks adopt “bottleneck” structure for spatial fully connected layers(figure 3).

Figure 3: The SegNet structure details. The network is designed to process either 2D or 3D frame sequence. Only one output(output1 or output2) is taken as the segment embedding.

For each long video, the segments’ feature sequence generated by SegNet are taken as inputs to VideoNet for video-level modelling. VideoNet use Bi-GRU for sequence encoding. The final GRU unit hidden layer, which is a fixed-length context vector, is taken as the output of VideoNet representing the video-level feature.

Each segment-level feature concatenates with the video-level feature of the same long video as the final segment representation. The representation contains not only the independent video segment information, but also the whole video sequence information, which is passed to highlight subnetwork to predict the segment importance score. Figure 4 shows the structure of highlight subnetwork.

Figure 4: The HighlightNet structure details. The network is designed to predict the segment importance score by fusion features.

3.3 Multi-task Learning and Data Augmentation for Video Summarization Network

We consider an auxiliary self-supervised task to better modelling video sequence. Some segments are selected from one video by the fixed proportion before inputting to the network. We shuffle the selected segments and train the network to distinguish the odd-position segments as shown in figure 5. This operation assumes that a good video sequence encoder has the ability to model the right segments order. We control the difficulty of the task by only indicate the odd-position segments but not sort the shuffled sequence to right order. Parameters

and are used for adjusting the shuffled segments‘ proportion and the weight of self-supervised task in multi-task learning. The final loss can be defined by:

The advantages of applying multi-task learning are:

1. Learning several tasks simultaneously can suppress early-stage overfitting, by sharing the same representations.

2. The auxiliary task is helpful in providing more useful information.

3. Our method implicitly performs data augmentation since we shuffle the input video sequence, which may improve the robustness of the algorithm.

4. This method utilize unlabelled data for video sequence modelling.

Figure 5: Self-taught learning for odd-position labelling. The input segment sequence are shuffled randomly by ratio. The video encoder learn to find the odd-position segments in the sequence.

A data augmentation method is proposed which only choose a portion of information from each segment when modelling the whole video each time. After modelling all portion of segments, we average the embeddings from different portion to one vector(figure 6).

Figure 6: Data augmentation method. VideoNet only models a portion of each segment to video embedding. Multiple embeddings are combined by average pooling.

4 Experiments

In this section, we provide our comparison experiment on CoView 2019 dataset. The dataset consists of 1200 videos for training and 300 for testing. Videos are sampled from Youtube-8M, Dense-Captioning, and Video summarization dataset. Every video from the dataset is segmented into a set of 5-second-long segments and asked 20 users to annotate importance score and 99 action / 79 scene labels. The average of importance scores and the most voted action/scene labels are provided as ground-truth in segment-level. Before any experiments, we randomly split videos into training(1000) and validation (200) set based on original videos instead of five-seconds videos.

4.1 Scene and Action Classification in Untrimmed Video

For scene/action classification task, top-5 accuracy on validation results show in table 1. a. joint training I3D+non-local attention with two SoftMax branches(JT). b. independent I3D+non-local model(IT). c. independent I3D+non-local model with external action classification data(AED). d. ResNet-50 with external scene image classification data(SED).

Method Scene Top-5 Action Top-5
a (JT) 82.76 81.72
b (IT) 86.92 87.77
c (AED) - 84.86
d (SED) 82.06 -
Table 1: Top-5 accuracy of different models on validation.

We train I3D+non-local with the pre-trained model on Kinetics, and the ResNet-50 pre-trained on ImageNet. The results indicate: 1. Although scene/action classification task may have inner connection, both joint training and ensemble failed to improve the results. 2. Pure 2D convolution-based network method is less accurate than 3D convolution-based network on both scene and action recognition task. 3. External data didn’t help improve classification accuracy on validation set, but we still use it for potential model generalization enhancement.

4.2 Video Summarization

We investigate our summarization network on different parameter scales, visual and visual-audio feature inputs, single-task and multi-task learning, with and without data augmentation.

Evaluation Protocol.

The video summarization evaluation metric computes the sum of importance scores of selected segments comparing with ground truth summary segments. The importance score metric is defined as

where N is the number of test videos, is the number of summarized segments. is the importance score from the i-th segment of the ground truth summary of the k-th video, and is the importance score of the i-th segment of the submitted summary for the k-th video. Importance Score is shared in both ground truth and submitted summary. is set to 6 for all videos, the top importance segments are ground truth summary. We randomly choose segments for 10 times as the baseline of summary score, the mean value is 74.92%.

Embedding Dimension. We explore different embedding dimension for the network by regularizing the same length of segNet and videoNet outputs. For each segment, we sample 16 frames for image-based extractor and 32 frames for video-based extractor. We use GoogleNet trained on ImageNet to extract image-based feature and I3D+non-local blocks trained on Kinetics to extract video-based feature. Table 2 shows our results for different feature length from 512 to 64. The 256-feature length is slightly better.

Feature Length SummaryScore
Dim = 512 81.47
Dim = 256 81.81
Dim = 128 81.49
Dim = 64 81.19
Table 2: Summary score on different embedding dimension.

Semantic Features Combination.

Image-based feature is extracted by image classifier. We select three types of image classifier: 1. ResNet-50 trained on CoView2019 scene classification task(R50_s). 2. ResNet-152 trained on ImageNet(R152_i). 3. GoogleNet trained on ImageNet(G_i). Video-based feature is extracted by video classifier. We choose two video classifier: 1. I3D+non-local model trained on Kinetics(INK). 2. I3D+non-local model trained on CoView2019 action recognition task(INA). Table 3 shows different visual feature combination on feature length 256. Our model didn’t benefit a lot from action and scene recognition task.

Feature Type SummaryScore
G_i+INK 81.81
G_i+INA 81.06
R50_s+INA 80.83
R152_i+INA 80.99
R50_s+INK 81.45
R152_i+INK 81.45
Table 3: Summary score with different visual semantic feature combination.

Multi-Task Learning. Since the self-taught learning increases the learning difficulty, we adopt 3-layers bi-GRU on video net and feature length equals to 512 with R50_s+INK features. Dropout and bottleneck structure are removed from segNet. The self-taught learning is firstly trained for two days using videos on CoView2019 with shuffle ratio. We use the trained model, with 96.64% accuracy on self-taught task and 78.6% recall on odd order segments, as pre-trained model in multi-task learning setting. In joint training, we set , and when evaluate on validation. Table 4 shows the comparison results, we evaluate three models: a. the baseline supervised learning without pre-trained model(sup_no_prt). b. supervised learning with self-taught pre-trained model(sup_with_prt). c. multi-task learning with self-taught pre-trained model(multi_with_prt). The results shows that multi-task setting is better than the baseline model with small margin.

Method SummaryScore
a. (sup_no_prt) 80.69
b. (sup_with_prt) 80.98
c. (multi_with_prt) 81.64

Table 4: Summary score with Multi-Task learning.

Data augmentation. The basic network setting is the same as R50_s+INK in ”Semantic Features Combination” experiment, we get 0.29% improvment from 81.45% to 81.74%.

Audio feature. We extract audio features MFCC, Chroma-gram for each segment by using python audio processing package LibROSA. We concatenate the audio features to 5450-dimension segment-level features before inputing to SegNet. The audio fusion model builds bases on Multi-Task setting in ”Multi-Task Learning” experiment. The result decrease from 81.64% to 80.42%, that may due to increased model complexity caused by larger feature dimension without important information.

Optical flow. Optical flow as a low-level feature describing video motion, is shown as complementary information to RGB in action recognition task. Optical flow feature sequence are obtained by using Gunnar Farneback’s algorithm to extract 16 frames for each segment. We modified SegNet in Multi-Task setting to 4 layers 3D convolution for processing optical flow features on temporal and spectral. Same as audio feature, this modification decrease summary score from 81.64% to 80.78%.

Model ensemble.

We ensemble five models: base 256 feature length model(81.81%), audio feature model(80.42%), optical flow model(80.78%), multi-task model(81.64%), data augmentation model(81.74%) by linear regression. Then we do model selection according to linear regression weights which direct us choosing base 256-length model and data augmentation model ensemble result(81.86%) on validation as the final submission.

5 Conclusion and Future Work

In this paper, we propose a scalable deep neural network for video summarization in content-based recommender formulation, which predicts the segments importance score by considering both segment and video level information. Our work shows that data augmentation and Multi-Task learning is helpful in solving the limitation of dataset problem. To better understand the video content, we also perform action and scene recognition in untrimmed video with state-of-the-art video classification algorithm. We experiment with combinations of high-level visual semantic features, audio features and optical flow, and concluded that visual semantic features play the most important role in this summarization task.

There are some areas we can further explore to improve comprehensive video understanding in the future. First of all, we didn’t benefit from action and scene connection in both recognition and summarization task, which leaves us room for better utilize this prior knowledge. Second, as we re-formalize video summarization in recommender framework, some state-of-the-art recommendation technologies can be introduced such as Collaborative Filtering, Factorization Machine, Wide & Deep Learning and so on. Last but not the least, although CoView2019, to our knowledge, provides the largest public video summarization dataset, it is too small in the scale of deep learning. We need more, like users’ action when browsing video websites, video language description, and large-scale dataset to accomplish such complex task.

References

  • [1] S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan (2016) Youtube-8m: a large-scale video classification benchmark. arXiv preprint arXiv:1609.08675. Cited by: §2.
  • [2] J. Carreira and A. Zisserman (2017) Quo vadis, action recognition? a new model and the kinetics dataset. In

    proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    ,
    pp. 6299–6308. Cited by: §3.1.
  • [3] J. Fajtl, H. S. Sokeh, V. Argyriou, D. Monekosso, and P. Remagnino (2018) Summarizing videos with attention. In ACCV Workshops, Cited by: §1, §1, §2.
  • [4] C. Feichtenhofer, H. Fan, J. Malik, and K. He (2018) Slowfast networks for video recognition. arXiv preprint arXiv:1812.03982. Cited by: §2.
  • [5] B. Fernando, H. Bilen, E. Gavves, and S. Gould (2017) Self-supervised video representation learning with odd-one-out networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3636–3645. Cited by: §2.
  • [6] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool (2014) Creating summaries from user videos. In European conference on computer vision, pp. 505–520. Cited by: §1.
  • [7] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §2.
  • [8] A. Kanehira, L. Van Gool, Y. Ushiku, and T. Harada (2018) Aware video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7435–7444. Cited by: §2.
  • [9] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014)

    Large-scale video classification with convolutional neural networks

    .
    In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §2.
  • [10] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. (2017) The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §2.
  • [11] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre (2011) HMDB: a large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563. Cited by: §2.
  • [12] H. Lee, J. Huang, M. Singh, and M. Yang (2017) Unsupervised representation learning by sorting sequences. In Proceedings of the IEEE International Conference on Computer Vision, pp. 667–676. Cited by: §2.
  • [13] J. Lee and S. Abu-El-Haija (2017) Large-scale content-only video recommendation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 987–995. Cited by: §1, §2.
  • [14] I. Misra, C. L. Zitnick, and M. Hebert (2016) Shuffle and learn: unsupervised learning using temporal order verification. In European Conference on Computer Vision, pp. 527–544. Cited by: §2.
  • [15] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes (2015) Tvsum: summarizing web videos using titles. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5179–5187. Cited by: §1.
  • [16] K. Soomro, A. R. Zamir, and M. Shah (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402. Cited by: §2.
  • [17] N. Srivastava, E. Mansimov, and R. Salakhudinov (2015) Unsupervised learning of video representations using lstms. In

    International conference on machine learning

    ,
    pp. 843–852. Cited by: §2.
  • [18] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich (2015) Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9. Cited by: §2.
  • [19] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri (2018) A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 6450–6459. Cited by: §2.
  • [20] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §2.
  • [21] X. Wang, R. Girshick, A. Gupta, and K. He (2018) Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803. Cited by: §2, §3.1.
  • [22] H. Wei, B. Ni, Y. Yan, H. Yu, X. Yang, and C. Yao (2018) Video summarization via semantic attended networks. In

    Thirty-Second AAAI Conference on Artificial Intelligence

    ,
    Cited by: §2.
  • [23] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba (2010) Sun database: large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3485–3492. Cited by: §3.1.
  • [24] K. Zhang, W. Chao, F. Sha, and K. Grauman (2016)

    Video summarization with long short-term memory

    .
    In European conference on computer vision, pp. 766–782. Cited by: §1, §2.
  • [25] K. Zhang, K. Grauman, and F. Sha (2018) Retrospective encoders for video summarization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 383–399. Cited by: §1, §1, §2.
  • [26] Y. Zhang, M. Kampffmeyer, X. Zhao, and M. Tan (2019) Dtr-gan: dilated temporal relational adversarial network for video summarization. In Proceedings of the ACM Turing Celebration Conference-China, pp. 89. Cited by: §1, §1, §2.
  • [27] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba (2017) Places: a 10 million image database for scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §3.1.
  • [28] K. Zhou, Y. Qiao, and T. Xiang (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2, §2.