Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization

04/17/2019 ∙ by Li Yuan, et al. ∙ National University of Singapore 17

In this paper, we present a novel unsupervised video summarization model that requires no manual annotation. The proposed model termed Cycle-SUM adopts a new cycle-consistent adversarial LSTM architecture that can effectively maximize the information preserving and compactness of the summary video. It consists of a frame selector and a cycle-consistent learning based evaluator. The selector is a bi-direction LSTM network that learns video representations that embed the long-range relationships among video frames. The evaluator defines a learnable information preserving metric between original video and summary video and "supervises" the selector to identify the most informative frames to form the summary video. In particular, the evaluator is composed of two generative adversarial networks (GANs), in which the forward GAN is learned to reconstruct original video from summary video while the backward GAN learns to invert the processing. The consistency between the output of such cycle learning is adopted as the information preserving metric for video summarization. We demonstrate the close relation between mutual information maximization and such cycle learning procedure. Experiments on two video summarization benchmark datasets validate the state-of-the-art performance and superiority of the Cycle-SUM model over previous baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

Figure 1: Overview of the Cycle-SUM model. The summary video is selected by the selector from the input original video. To optimize the selector, a cycle-consistent adversarial LSTM evaluator is introduced to evaluate the summary quality through cycle-consistent learning to measure the mutual information between the original and summary video.

With explosion of video data, video summarization technologies [Ma et al.2002, Pritch et al.2007, Lu and Grauman2013] become increasingly attractive to help efficiently browse, manage and retrieve video contents. With such techniques, a long video can be shortened to different forms, e.g. key shots [Gygli et al.2014], key frames [Kim, Sigal, and Xing2014] and key objects [Meng et al.2016]. Here, we aim at selecting key frames for summarizing a video.

Video summarization is usually formulated as a structure prediction problem [Zhang et al.2016, Mahasseni, Lam, and Todorovic2017]. The model takes as input a sequence of video frames, and outputs a subset of original video frames containing critical information. Ideally, the summary video should keep all key information of the original video with minimal redundancy. Summary completeness and compactness are expected for good video summarization.

Existing approaches can be roughly grouped into supervised and unsupervised ones. Many supervised approaches [Zhang et al.2016, Gygli, Grabner, and Van Gool2015] utilize human-annotated summary as ground truth to train a model. However, sufficient human-annotated video summarization examples are not always available or expensive to collect. Thus, unsupervised approaches that do not require human intervention become increasingly attractive due to their low cost. For these methods, it is very critical to design a proper summary quality metric. For instance, [Mahasseni, Lam, and Todorovic2017] adopt the GAN [Goodfellow et al.2014] to measure the similarity between summary and original video and improve the summarization model by optimizing the induced objective, based on a basic idea that a good summary video should be able to faithfully reconstruct the original input video. However, this approach only considers one-direction reconstruction, thus some significant frames may dominate the quality measure, leading to severe information loss in the summary video.

In this paper, we propose a novel cycle-consistent unsupervised model, motivated by maximizing the mutual information between summary video and original video. Our model is developed with a new cycle-consistent adversarial learning objective to pursue optimal information preserving for the summary video, partially inspired by the cycle generative adversarial network [Zhu et al.2017, Yi et al.2017]. Moreover, to effectively capture the short-range and long-range dependencies among sequential frames [Zhang et al.2016], we propose a VAE-based LSTM network as the backbone model for learning video representation. We name such a cycle-consistent adversarial LSTM network for video summarization as the Cycle-SUM.

Cycle-SUM performs original and summary video reconstruction in a cycle manner, and leverages consistency between original/summary video and its cycle reconstruction result to “supervise” the video summarization. Such a cycle-consistent objective guarantees the summary completeness without additional supervision. Compared with the one-direction reconstruction (i.e., from summary video to original video) [Zhu et al.2017, Yi et al.2017], the bi-direction model performs a reversed reconstruction and a cycle-consistent reconstruction to relieve information loss.

Structurally, the Cycle-SUM model consists of two components: a selector to predict an importance score for each frame and select the frames with high importance scores to form the summary video, and a cycle-consistent evaluator to evaluate the quality of selected frames through cycle reconstruction. To achieve effective information preserving, the supervisor employs two VAE-based generators and two discriminators to evaluate the cycle-consistent loss. The forward generator and discriminator are responsible for reconstructing the original video from the summary video, and the backward counterparts perform the backward reconstruction from original to the summary video. Both reconstructions are performed in the learned embedding feature space. The discriminator is trained to distinguish the summary video from original. If the summary video misses some informative frames, the discriminator would tell its difference with the original and thus serves as a good evaluator to encourage the selector to pick important frames.

An illustration of the proposed framework is given in Fig. 1

. The summary video is a subset of all training video frames, selected by the selector based on the predicted frame-wise importance scores. The original video is reconstructed from the summary video, and then back again. Given a distance between original video and summary video in the deep feature space, the Cycle-SUM model tries to optimize the selector such that the distance is minimized over training examples. The closed loop of Cycle-SUM is aimed at 1) assisting the Bi-LSTM selector to select a subset of frames from the original video, and 2) keeping a suitable distance between summary video and original video in the deep features space to improve summary completeness and reduce redundancy.

Our contributions are three-fold. 1) We introduce a new unsupervised video summarization model that does not require any manual annotation on video frame importance yet achieves outstanding performance. 2) We propose a novel cycle-consistent adversarial learning model. Compared with one-direction reconstruction based models, our model is superior in information preserving and facilitating the learning procedure. 3) We theoretically derive the relation of mutual information maximization, between summary and original video, with the proposed cycle-consistent adversarial learning model. To our best knowledge, this work is the first to transparently reveal how to effectively maximize mutual information by cycle adversarial learning.

Related Work

Supervised video summarization approaches leverage videos with human annotation on frame importance to train models. For example, Gong et al. formulate video summarization as a supervised subset selection problem and propose a sequential determinantal point processing (seqDPP) based model to sample a representative and diverse subset from training data [Gong et al.2014]. To relieve human annotation burden and reduce cost, unsupervised approaches, which have received increasing attention, generally design different criteria to give importance ranking over frames for selection. For example, [Wang et al.2012, Potapov et al.2014] propose to select frames according to their content relevance. [Mei et al.2015, Cong, Yuan, and Luo2012] design unsupervised critera by trying to reconstruct the original video from selected key frames and key shots under the dictionary learning framework. Clustering-based models [De Avila et al.2011, Kuanar, Panda, and Chowdhury2013] and attention-based models [Ma et al.2002, Ejaz, Mehmood, and Baik2013] are also developed to select key frames.

Recently, deep learning models are developed for both supervised and unsupervised video summarization, in which LSTM is usually taken as the video representation model. For example,

[Zhang et al.2016]

treat video summarization as a sequential prediction problem inspired by speech recognition. They present a bi-direction LSTM architecture to learn the representation of sequential frames in variable length and output a binary vector to indicate which frame to be selected. Our proposed Cycle-Sum model also adopts LSTM as backbone for learning long-range dependence between video frames.

Figure 2: Demonstration of Cycle-SUM architecture. Red parts denote the components of our Cycle-SUM model while blue one denotes data-processing. Cycle-SUM has two parts: the selector for selecting frames and the cycle-consistent evaluator to ”supervise” the selection. The feature of the frame in original video is extracted from video by a deep CNN. The selector takes as input and outputs the importance scores . During training, the generator takes as input and reconstructs a sequence of features, . The discriminator is trained to distinguish and . The generator takes as input and outputs ; the discriminator also tries to distinguish between and . To achieve cycle consistency, the forward cycle and the backward cycle are implemented to encourage the information to be consistent between and .

Method

The proposed Cycle-SUM model formulates video summarization as a sequence-to-sequence learning problem, taking as input a sequence of video frames and outputting a sequence of frame-wise importance scores. The frames with high importance scores are selected to form a summary video. Throughout the paper, we use and to denote the original input video and summary video respectively, and and to denote the frame-level features of and respectively. To train Cycle-SUM in an unsupervised manner, we develop the cycle-consistent learning method for maximizing mutual information between and .

Mutual Information Maximization via Cycle Learning

Video summarization is essentially aimed at extracting video frames that contain critical information of the original video. In this subsection, we explain how to derive our cycle-consistent learning objective through the desired objective of maximizing the mutual information between the summary video and the original video .

Formally, the mutual information is defined as

where is the KL-divergence between two distributions. Then the objective of video summarization is to extract the summary video from to maximize their mutual information. The video summarization model should try to produce such that its conditional distribution gives the maximal mutual information with

. However, though it is easy to obtain empirical distribution estimation of original video

, it is difficult to obtain ground truth distribution of corresponding

in an unsupervised learning scenario. This makes one major challenge to unsupervised video summarization.

We propose a cycle-consistent learning objective to relieve such learning difficulty. We notice that

(1)

The above mutual information computation “anchors” at that can be faithfully estimated and thus eases the procedure of learning distribution of even in an unsupervised learning setting.

To effectively model and optimize the above learning objective, we adopt the Fenchel conjugate to derive its bound that is easier to optimize. The Fenchel conjugate of a function is defined as , or equivalently .

Thus, defining , we have the following upper bound for the KL-divergence between distributions and :

where and is an arbitrary class of functions . The above inequality is due to the Jensen’s inequality and functions is only a subset of all possible functions. Therefore, we have

Here is the set of produced summary videos. We can use a generative model to estimate . To this end, we follow the generative-adversarial approach [Goodfellow et al.2014]

and use two neural networks,

and , to implement sampling and data transformation. Here is the forward generative model, taking the condition as input and outputting a sample of summary sample . is the forward discriminative model. We learn the generative model by finding a saddle-point of the above objective function, where we minimize w.r.t.  and maximize w.r.t. :

(2)

The above objective is similar to the one of GANs, but the generative model is a conditioned one.

Similarly, we can obtain the learning objective to optimize the KL-divergence by solving

(3)

Substituting Eqn. (2) and Eqn. (3) into Eqn. (1) gives the following cycle learning objective to maximize the mutual information between the original and summary video:

(4)

where we omit the constant number of original frames. To relieve the difficulties brought by the unknown distribution , we use the following cycle-consistent constraint to further regularize the generative model and the cycle learning processing:

We name cycle learning with the above consistent constraint as the cycle-consistent learning.

Architecture

Based on the above derivations, we design the cycle-consistent adversarial model for video summarization (Cycle-SUM). The architecture of our Cycle-SUM model is illustrated in Fig. 2. The selector is a Bi-LSTM network, which is trained to predict an importance score for every frame in the input video. The evaluator consists of two pairs of generators and discriminators. In particular, the forward generator and the discriminator form the forward GAN; the backward generator and the discriminator form the backward GAN. The two generators are implemented by variational auto-encoder LSTM, which encode the frame feature to the latent variable and then decode it to corresponding features. The two discriminators are LSTM networks that learn to distinguish generated frame features and true features. We extensively use the LSTM architecture here for comprehensively modeling the temporal information across video frames. Moreover, by adopting the joint structure of VAE and GAN, the video similarity can be more reliably measured by generating better high-level representations [Larsen et al.2015]. The cycle structure (forward GAN and backward GAN) convert from original to summary video and back again, in which information loss is minimized.

Given a video of frames, the first step is to extract its deep features via a deep CNN model. Given these features , the selector predicts a sequence of importance scores indicating the importance level of corresponding frames. During training, the frame feature of summary video . But for testing, Cycle-SUM outputs discretized importance scores ; then frames with importance scores being 1 are selected.

With and , the supervisor performs cycle-consistent learning (see Fig. 2) to evaluate the quality of summary video w.r.t. both completeness and compactness. Specifically, within the selector, the forward generator takes the current summary video as input and outputs a sequence of reconstructed features for the original video, namely . The paired discriminator then estimates the distribution divergence between original video and summary video in the learned feature space. The backward generator and discriminator have a symmetrical network architecture and training procedure to the forward ones. In particular, the generator takes the original video feature as input, and outputs to reconstruct the summary video. The discriminator then tries to distinguish between and . The forward cycle-consistency processing and the backward cycle-consistency are implemented to enhance the information consistency between and . This cycle-consistent processing guarantees the original video to be reconstructed from the summary video and vice versa, meaning the summary video can tell the same story as the original.

Training Loss

We design the following loss functions to train our Cycle-SUM model. The sparsity loss

is used to control the summary length for the selector; the prior loss and the reconstruction loss are used to train the two VAE-based generators; adversarial losses are derived from the forward GAN and backward GAN; and is the cycle-consistent loss.

Sparsity loss

This loss is designed to penalize the number of selected frames forming the summary video over the original video. A high sparsity ratio gives a shorter summary video. Formally, it is defined as

where is the total number of video frames and is a pre-defined percentage of frames to select in video summarization. The ground truth of in standard benchmark is 15% [Gygli et al.2014, Song et al.2015], but we empirically set as 30% for the selector in training [Mahasseni, Lam, and Todorovic2017].

Generative loss

We adopt VAE as the generator for reconstruction, thus contains the prior loss and the reconstruction loss . For the forward VAE (forward generator ), the encoder encodes input features into the latent variable . Assume is the prior distribution of latent variables, and the typical reparameterization trick is to set

as Gaussian Normal distribution 

[Kingma and Welling2013]. Define as the posterior distribution and as conditional generative distribution for , where is the parameter of the encoder and is that of the decoder. The objective of the forward generator is

where the first term is KL divergence for the prior loss: .

The second term is an element-wise metric for measuring the similarity between samples, so we use it as the reconstruction loss . The typical reconstruction loss for auto encoder networks is the Euclidean distance between input and reconstructed output: . According to [Larsen et al.2015]

, element-wise metrics cannot model properties of human visual perception, thus they propose to jointly train the VAE (the generator) and the GAN discriminator, where hidden representation is used in the discriminator to measure sample similarity. Our proposed Cycle-SUM also adopts the same structure to measure the video distance and achieves a feature-wise reconstruction. Specifically, if

and are the input and output of the VAE (the generator), the output of the last hidden layer of the discriminator are and . Then, consider . The expectation can be computed by empirical average. The reconstruction loss can be re-written as

The backward generator has the same and as the forward generator. The only difference is that their input and output are reversed.

Adversarial loss

The learning objective of the evaluator is to maximize the mutual information between the original and summary video. According to Eqn. (4), the adversarial losses for the forward GAN ( and ) and the backward GAN ( and ) are

To avoid mode collapse and improve stability of optimization, we use the loss suggested in Wasserstein GAN [Arjovsky, Chintala, and Bottou2017]:

Cycle-Consistent Loss

Since we expect summary frame features to contain all the information of original frame features, the original video should be fully reconstructed from them. Thus, when we convert from original to summary video and then back again, we should obtain a video similar to the original one. In this way we can safely guarantee the completeness of the summary video. This processing is more advantageous than the one-direction reconstruction in existing image reconstruction works [Zhu et al.2017, Yi et al.2017]. Based on such an intuition, we introduce the below cycle-consistent loss. The procedure for forward cycle is . The is correspondingly defined as . For the backward cycle, the procedure is . So the is . We adopt distance for , since the often leads to blurriness [Larsen et al.2015, He et al.2016].

Overall Loss

The overall loss function is the overall objective for training the Cycle-SUM model:

where , and are hyper parameters to balance adversarial processing, generative processing and cycle-consistent processing.

Training Cycle-SUM

Given the above training losses and final objective function, we adopt the Stochastic Gradient Variational Bayes estimation [Kingma and Welling2014] to update the parameters in training. The selector and the generators in the Cycle-SUM are jointly trained to maximally confuse the discriminators. To stabilize the training process, we initialize all parameters with Xavier [Glorot and Bengio2010] and clip all parameters [Arjovsky, Chintala, and Bottou2017]. The clipping parameter in this training falls in . The typical value for the generator iteration per discriminator iteration is , which means the generator will iterate times per discriminative iteration. Algorithm 1 summarizes all steps for training Cycle-SUM.

0:  Frame features of the training video:
0:  Learned parameters for the selector , the two generators and discriminators: , , ,
1:  Initialize all parameters by using Xavier approach
2:  repeat
3:   for i = 1,…, do
4:     original frame-level features from CNN
5:     selector() % selected frame-level features
6:     % reconstruction by generator A
7:     % reconstruction by generator B
8:     % forward cycle
9:     % backward cycle
10:

    % Updates using RMSProp

11:    
12:    
13:   end for
14:    %maximization update
15:   
16:    %maximization update
17:   
18:  until convergence
Algorithm 1 Training Cycle-SUM model

Experiment

Experiment Setup

Datasets and Protocol

We evaluate Cycle-SUM on two benchmark datasets: SumMe [Gygli et al.2014] and TVSum [Song et al.2015]. The SumMe contains 25 videos ranging from 1 to 7 minutes, with frame-level binary importance scores. The TVSum contains 50 videos downloaded from YouTube, with shot-level importance scores for each video taking constant from 1 to 5.

Following the convention [Gygli, Grabner, and Van Gool2015, Zhang et al.2016, Mahasseni, Lam, and Todorovic2017]

, we adopt the F-measure as the performance metric. Given ground truth and produced summary video, we calculate the harmonic mean F-Scores according to precision and recall for evaluation.

For the TVSum dataset, shot-level ground truths are provided while the outputs of Cycle-SUM in testing are frame-level scores. Thus we follow the method in  [Zhang et al.2016] to convert frame-level evaluation to shot-level evaluation.

Implementation Details

For fairness, the frame features used for training our model are the same with [Zhang et al.2016, Mahasseni, Lam, and Todorovic2017]. We extract 1024-d frame features from the output of pool5 layer of the GoogLeNet network [Szegedy et al.2015]

which is pre-trained on ImageNet 

[Deng et al.2009].

Each of the two generators in our Cycle-SUM is a VAE-based LSTM network consisting of an encoder and a decoder, which has two-layers with hidden units per layer. The decoder LSTM which reconstructs the sequence reversely is easier to train [Srivastava, Mansimov, and Salakhudinov2015]

, so the decoders in Cycle-SUM also reconstruct the frame features in a reverse order. The discriminators are LSTM networks followed by a fully-connected network to produce probability (true or false) for the input. Following the architecture of WGAN 

[Arjovsky, Chintala, and Bottou2017]

, we remove the Sigmoid function in the last layer of the discriminator to make the model easier to train. The selector is a Bi-LSTM network consisting of three layers, each with

hidden units.

The two VAE generators are initialized by pre-training on frame features of the original video. Similar to [Mahasseni, Lam, and Todorovic2017], such an initialization strategy can also accelerate training and improve overall accuracy.

All experiments are conducted for five times on five random splits and we report the average performance.

Quantitative Results

SumMe TVSum
De Avila, et al. [De Avila et al.2011] 33.7 -
Li, et al.  [Li and Merialdo2010] 26.6 -
Khosla, et al.  [Khosla et al.2013] - 50
Song, et al.  [Song et al.2015] 26.6 50
SUM-GAN [Mahasseni, Lam, and Todorovic2017] 39.1 51.7
Cycle-SUM 41.9 57.6
Table 1: Comparison on F-scores of Cycle-SUM with other unsupervised learning approaches on SumMe and TVSum.

We compare our Cycle-SUM model with several unsupervised state-of-the-arts in Tab. 1. One can see that the Cycle-SUM model outperforms all of them by a margin up to 2.8%. In particular, Cycle-SUM outperforms SUM-GAN [Mahasseni, Lam, and Todorovic2017] across the two datasets, clearly demonstrating effectiveness of our proposed cycle-consistent loss. On TVSum, the performance improvement is over 5.9%. These results well prove the superior performance of Cycle-SUM for video summarization.

Figure 3: Comparison of selected frames w.r.t. importance score by Cycle-SUM and other state-of-arts (vsLSTM and SUM-GAN). Dark blue bars show ground-truth frame-level annotation; Red bars are selected subset shots of all frames. The example video (# 15) is from TVSum.

Ablation Analysis of Cycle-SUM

SumMe TVSum
Cycle-SUM-C 34.8 49.5
Cycle-SUM-1G 38.2 51.4
Cycle-SUM-2G 39.7 53.6
Cycle-SUM-Gf 40.3 55.2
Cycle-SUM-Gb 39.9 55.0
Cycle-SUM 41.9 57.6
Table 2: Performance comparison (F-scores) of the vanilla Cycle-SUM model and its ablation variants on SumMe and TVSum.

We further conduct ablation analysis to study the effects of different components of the Cycle-SUM model, including the generative adversarial learning and cycle-consistent learning. In particular, we consider following ablation variants of Cycle-SUM.

  • Cycle-SUM-C. This variant is proposed to verify the effects of adversarial learning. It drops the adversarial loss and and keeps all other losses, especially cycle-consistent loss in the overall loss.

  • Cycle-SUM-2G. The cycle-consistent loss and are removed in overall loss. The forward GAN and backward GAN are still kept. We compare the results of Cycle-SUM-2G with Cycle-SUM to analyze the functions of cycle-consistent reconstruction.

  • Cycle-SUM-1G. The cycle-consistent loss and are not used when training this variant. Meanwhile, we remove the generator and discriminator , so there is no backward reconstruction: . This model is similar to SUM-GAN [Mahasseni, Lam, and Todorovic2017]. It only has forward GAN, and Cycle-SUM-2G contains the two GANs during training. We are also interested in comparing Cycle-SUM-2G and Cycle-SUM-1G.

  • Cycle-SUM-Gf. The backward cycle-consistent loss is not included in overall objective when training, while forward cycle-consistent loss is still kept. The forward and backward adversarial learning are still kept.

  • Cycle-SUM-Gb. The forward cycle-consistent loss is not included in overall objective when training, and backward one is kept. The forward and backward adversarial learning are still kept.

Comparing F-scores of Cycle-SUM and Cycle-SUM-C in Tab. 2, we can see the adversarial learning improves the performance significantly, proving the positive effects of deploying GAN in unsupervised video summarization. Compared with one GAN variant Cycle-SUM-1G, the results of Cycle-SUM-2G have 2% gain on average. Both comparisons verify adversarial learning helps improve video summarization.

By comparing Cycle-SUM-2G and Cycle-SUM, we can see averagely F-scores rise by 2.2% on SumMe and 4.0% on TVSum. This proves that cycle-consistent reconstructions can ensure the summary video contain full information of the original video.

The results of Cycle-SUM-Gf are slightly better than Cycle-SUM-Gb. However, both variants bring performance gain over Cycle-SUM-2G, which also proves the forward and backward cycle-consistent processing can promote the ability to select a fully summary video from the original.

To sum up, the adversarial learning ensures the summary and original video to keep a suitable distance in the deep feature space, and the cycle-consistent learning ensures selected frames to retain full information of the original video.

Qualitative Results

Fig. 3 shows summarization examples from a sample video in TVSum. We compare the selected frames of Cycle-SUM with other two recent state-of-the-arts, vsLSTM [Zhang et al.2016] and SUM-GAN [Mahasseni, Lam, and Todorovic2017] by using a successful example for all three models. As shown in Fig. 3, Cycle-SUM selects shorter but more key shots than the other two models. Compared with results of vsLSTM and SUM-GAN, some topic-specific and informative details, e.g. frames showing the doctor pushing medicinal liquid into dog’s ear, are correctly selected by Cycle-SUM.

Conclusion

In this paper, we theoretically reveal how to effectively maximize mutual information by cycle-consistent adversarial learning. Based on the theoretical analysis, we propose a new Cycle-SUM model for frame-level video summarization. Experimental results show that the cycle-consistent mechanism can significantly improve video summarization, and our Cycle-SUM can produce more precise summary video than strong baselines, which well validates effectiveness of our method.

Acknowledgments

This work was spported in part to Jiashi Feng by NUS IDS R-263-000-C67-646, ECRA R-263-000-C87-133 and MOE Tier-II R-263-000-D17-112, in part to Ping Li by NSFC under Grant 61872122, 61502131, and in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LY18F020015.

References

  • [Arjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
  • [Cong, Yuan, and Luo2012] Cong, Y.; Yuan, J.; and Luo, J. 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14(1):66–75.
  • [De Avila et al.2011] De Avila, S. E. F.; Lopes, A. P. B.; da Luz Jr, A.; and de Albuquerque Araújo, A. 2011. Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32(1):56–68.
  • [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In CVPR, 248–255. IEEE.
  • [Ejaz, Mehmood, and Baik2013] Ejaz, N.; Mehmood, I.; and Baik, S. W. 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28(1):34–44.
  • [Glorot and Bengio2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 249–256.
  • [Gong et al.2014] Gong, B.; Chao, W.-L.; Grauman, K.; and Sha, F. 2014. Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems, 2069–2077.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
  • [Gygli et al.2014] Gygli, M.; Grabner, H.; Riemenschneider, H.; and Van Gool, L. 2014. Creating summaries from user videos. In ECCV, 505–520. Springer.
  • [Gygli, Grabner, and Van Gool2015] Gygli, M.; Grabner, H.; and Van Gool, L. 2015. Video summarization by learning submodular mixtures of objectives. In CVPR, 3090–3098.
  • [He et al.2016] He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; and Ma, W.-Y. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, 820–828.
  • [Khosla et al.2013] Khosla, A.; Hamid, R.; Lin, C.-J.; and Sundaresan, N. 2013. Large-scale video summarization using web-image priors. In CVPR, 2698–2705.
  • [Kim, Sigal, and Xing2014] Kim, G.; Sigal, L.; and Xing, E. P. X. 2014. Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In CVPR. IEEE.
  • [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Stochastic gradient vb and the variational auto-encoder. In ICLR.
  • [Kuanar, Panda, and Chowdhury2013] Kuanar, S. K.; Panda, R.; and Chowdhury, A. S. 2013. Video key frame extraction through dynamic delaunay clustering with a structural constraint. Journal of Visual Communication and Image Representation 24(7):1212–1227.
  • [Larsen et al.2015] Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; and Winther, O. 2015. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300.
  • [Li and Merialdo2010] Li, Y., and Merialdo, B. 2010. Multi-video summarization based on video-mmr. In International Workshop on Image Analysis for Multimedia Interactive Services, 1–4. IEEE.
  • [Lu and Grauman2013] Lu, Z., and Grauman, K. 2013. Story-driven summarization for egocentric video. In CVPR, 2714–2721. IEEE.
  • [Ma et al.2002] Ma, Y.-F.; Lu, L.; Zhang, H.-J.; and Li, M. 2002.

    A user attention model for video summarization.

    In ACM Multimedia, 533–542. ACM.
  • [Mahasseni, Lam, and Todorovic2017] Mahasseni, B.; Lam, M.; and Todorovic, S. 2017. Unsupervised video summarization with adversarial lstm networks. In CVPR.
  • [Mei et al.2015] Mei, S.; Guan, G.; Wang, Z.; Wan, S.; He, M.; and Feng, D. D. 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48(2):522–533.
  • [Meng et al.2016] Meng, J.; Wang, H.; Yuan, J.; and Tan, Y.-P. 2016. From keyframes to key objects: Video summarization by representative object proposal selection. In CVPR, 1039–1048.
  • [Potapov et al.2014] Potapov, D.; Douze, M.; Harchaoui, Z.; and Schmid, C. 2014. Category-specific video summarization. In ECCV, 540–555. Springer.
  • [Pritch et al.2007] Pritch, Y.; Rav-Acha, A.; Gutman, A.; and Peleg, S. 2007. Webcam synopsis: Peeking around the world. In ICCV, 1–8. IEEE.
  • [Song et al.2015] Song, Y.; Vallmitjana, J.; Stent, A.; and Jaimes, A. 2015. Tvsum: Summarizing web videos using titles. In CVPR, 5179–5187.
  • [Srivastava, Mansimov, and Salakhudinov2015] Srivastava, N.; Mansimov, E.; and Salakhudinov, R. 2015. Unsupervised learning of video representations using lstms. In ICML, 843–852.
  • [Szegedy et al.2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.; et al. 2015. Going deeper with convolutions. In CVPR.
  • [Wang et al.2012] Wang, M.; Hong, R.; Li, G.; Zha, Z.-J.; Yan, S.; and Chua, T.-S. 2012. Event driven web video summarization by tag localization and key-shot identification. IEEE Transactions on Multimedia 14(4):975–985.
  • [Yi et al.2017] Yi, Z.; Zhang, H.; Tan, P.; and Gong, M. 2017.

    Dualgan: Unsupervised dual learning for image-to-image translation.

    arXiv preprint.
  • [Zhang et al.2016] Zhang, K.; Chao, W.-L.; Sha, F.; and Grauman, K. 2016.

    Video summarization with long short-term memory.

    In ECCV, 766–782. Springer.
  • [Zhu et al.2017] Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint arXiv:1703.10593.