Introduction
With explosion of video data, video summarization technologies [Ma et al.2002, Pritch et al.2007, Lu and Grauman2013] become increasingly attractive to help efficiently browse, manage and retrieve video contents. With such techniques, a long video can be shortened to different forms, e.g. key shots [Gygli et al.2014], key frames [Kim, Sigal, and Xing2014] and key objects [Meng et al.2016]. Here, we aim at selecting key frames for summarizing a video.
Video summarization is usually formulated as a structure prediction problem [Zhang et al.2016, Mahasseni, Lam, and Todorovic2017]. The model takes as input a sequence of video frames, and outputs a subset of original video frames containing critical information. Ideally, the summary video should keep all key information of the original video with minimal redundancy. Summary completeness and compactness are expected for good video summarization.
Existing approaches can be roughly grouped into supervised and unsupervised ones. Many supervised approaches [Zhang et al.2016, Gygli, Grabner, and Van Gool2015] utilize humanannotated summary as ground truth to train a model. However, sufficient humanannotated video summarization examples are not always available or expensive to collect. Thus, unsupervised approaches that do not require human intervention become increasingly attractive due to their low cost. For these methods, it is very critical to design a proper summary quality metric. For instance, [Mahasseni, Lam, and Todorovic2017] adopt the GAN [Goodfellow et al.2014] to measure the similarity between summary and original video and improve the summarization model by optimizing the induced objective, based on a basic idea that a good summary video should be able to faithfully reconstruct the original input video. However, this approach only considers onedirection reconstruction, thus some significant frames may dominate the quality measure, leading to severe information loss in the summary video.
In this paper, we propose a novel cycleconsistent unsupervised model, motivated by maximizing the mutual information between summary video and original video. Our model is developed with a new cycleconsistent adversarial learning objective to pursue optimal information preserving for the summary video, partially inspired by the cycle generative adversarial network [Zhu et al.2017, Yi et al.2017]. Moreover, to effectively capture the shortrange and longrange dependencies among sequential frames [Zhang et al.2016], we propose a VAEbased LSTM network as the backbone model for learning video representation. We name such a cycleconsistent adversarial LSTM network for video summarization as the CycleSUM.
CycleSUM performs original and summary video reconstruction in a cycle manner, and leverages consistency between original/summary video and its cycle reconstruction result to “supervise” the video summarization. Such a cycleconsistent objective guarantees the summary completeness without additional supervision. Compared with the onedirection reconstruction (i.e., from summary video to original video) [Zhu et al.2017, Yi et al.2017], the bidirection model performs a reversed reconstruction and a cycleconsistent reconstruction to relieve information loss.
Structurally, the CycleSUM model consists of two components: a selector to predict an importance score for each frame and select the frames with high importance scores to form the summary video, and a cycleconsistent evaluator to evaluate the quality of selected frames through cycle reconstruction. To achieve effective information preserving, the supervisor employs two VAEbased generators and two discriminators to evaluate the cycleconsistent loss. The forward generator and discriminator are responsible for reconstructing the original video from the summary video, and the backward counterparts perform the backward reconstruction from original to the summary video. Both reconstructions are performed in the learned embedding feature space. The discriminator is trained to distinguish the summary video from original. If the summary video misses some informative frames, the discriminator would tell its difference with the original and thus serves as a good evaluator to encourage the selector to pick important frames.
An illustration of the proposed framework is given in Fig. 1
. The summary video is a subset of all training video frames, selected by the selector based on the predicted framewise importance scores. The original video is reconstructed from the summary video, and then back again. Given a distance between original video and summary video in the deep feature space, the CycleSUM model tries to optimize the selector such that the distance is minimized over training examples. The closed loop of CycleSUM is aimed at 1) assisting the BiLSTM selector to select a subset of frames from the original video, and 2) keeping a suitable distance between summary video and original video in the deep features space to improve summary completeness and reduce redundancy.
Our contributions are threefold. 1) We introduce a new unsupervised video summarization model that does not require any manual annotation on video frame importance yet achieves outstanding performance. 2) We propose a novel cycleconsistent adversarial learning model. Compared with onedirection reconstruction based models, our model is superior in information preserving and facilitating the learning procedure. 3) We theoretically derive the relation of mutual information maximization, between summary and original video, with the proposed cycleconsistent adversarial learning model. To our best knowledge, this work is the first to transparently reveal how to effectively maximize mutual information by cycle adversarial learning.
Related Work
Supervised video summarization approaches leverage videos with human annotation on frame importance to train models. For example, Gong et al. formulate video summarization as a supervised subset selection problem and propose a sequential determinantal point processing (seqDPP) based model to sample a representative and diverse subset from training data [Gong et al.2014]. To relieve human annotation burden and reduce cost, unsupervised approaches, which have received increasing attention, generally design different criteria to give importance ranking over frames for selection. For example, [Wang et al.2012, Potapov et al.2014] propose to select frames according to their content relevance. [Mei et al.2015, Cong, Yuan, and Luo2012] design unsupervised critera by trying to reconstruct the original video from selected key frames and key shots under the dictionary learning framework. Clusteringbased models [De Avila et al.2011, Kuanar, Panda, and Chowdhury2013] and attentionbased models [Ma et al.2002, Ejaz, Mehmood, and Baik2013] are also developed to select key frames.
Recently, deep learning models are developed for both supervised and unsupervised video summarization, in which LSTM is usually taken as the video representation model. For example,
[Zhang et al.2016]treat video summarization as a sequential prediction problem inspired by speech recognition. They present a bidirection LSTM architecture to learn the representation of sequential frames in variable length and output a binary vector to indicate which frame to be selected. Our proposed CycleSum model also adopts LSTM as backbone for learning longrange dependence between video frames.
Method
The proposed CycleSUM model formulates video summarization as a sequencetosequence learning problem, taking as input a sequence of video frames and outputting a sequence of framewise importance scores. The frames with high importance scores are selected to form a summary video. Throughout the paper, we use and to denote the original input video and summary video respectively, and and to denote the framelevel features of and respectively. To train CycleSUM in an unsupervised manner, we develop the cycleconsistent learning method for maximizing mutual information between and .
Mutual Information Maximization via Cycle Learning
Video summarization is essentially aimed at extracting video frames that contain critical information of the original video. In this subsection, we explain how to derive our cycleconsistent learning objective through the desired objective of maximizing the mutual information between the summary video and the original video .
Formally, the mutual information is defined as
where is the KLdivergence between two distributions. Then the objective of video summarization is to extract the summary video from to maximize their mutual information. The video summarization model should try to produce such that its conditional distribution gives the maximal mutual information with
. However, though it is easy to obtain empirical distribution estimation of original video
, it is difficult to obtain ground truth distribution of correspondingin an unsupervised learning scenario. This makes one major challenge to unsupervised video summarization.
We propose a cycleconsistent learning objective to relieve such learning difficulty. We notice that
(1)  
The above mutual information computation “anchors” at that can be faithfully estimated and thus eases the procedure of learning distribution of even in an unsupervised learning setting.
To effectively model and optimize the above learning objective, we adopt the Fenchel conjugate to derive its bound that is easier to optimize. The Fenchel conjugate of a function is defined as , or equivalently .
Thus, defining , we have the following upper bound for the KLdivergence between distributions and :
where and is an arbitrary class of functions . The above inequality is due to the Jensen’s inequality and functions is only a subset of all possible functions. Therefore, we have
Here is the set of produced summary videos. We can use a generative model to estimate . To this end, we follow the generativeadversarial approach [Goodfellow et al.2014]
and use two neural networks,
and , to implement sampling and data transformation. Here is the forward generative model, taking the condition as input and outputting a sample of summary sample . is the forward discriminative model. We learn the generative model by finding a saddlepoint of the above objective function, where we minimize w.r.t. and maximize w.r.t. :(2)  
The above objective is similar to the one of GANs, but the generative model is a conditioned one.
Similarly, we can obtain the learning objective to optimize the KLdivergence by solving
(3)  
Substituting Eqn. (2) and Eqn. (3) into Eqn. (1) gives the following cycle learning objective to maximize the mutual information between the original and summary video:
(4)  
where we omit the constant number of original frames. To relieve the difficulties brought by the unknown distribution , we use the following cycleconsistent constraint to further regularize the generative model and the cycle learning processing:
We name cycle learning with the above consistent constraint as the cycleconsistent learning.
Architecture
Based on the above derivations, we design the cycleconsistent adversarial model for video summarization (CycleSUM). The architecture of our CycleSUM model is illustrated in Fig. 2. The selector is a BiLSTM network, which is trained to predict an importance score for every frame in the input video. The evaluator consists of two pairs of generators and discriminators. In particular, the forward generator and the discriminator form the forward GAN; the backward generator and the discriminator form the backward GAN. The two generators are implemented by variational autoencoder LSTM, which encode the frame feature to the latent variable and then decode it to corresponding features. The two discriminators are LSTM networks that learn to distinguish generated frame features and true features. We extensively use the LSTM architecture here for comprehensively modeling the temporal information across video frames. Moreover, by adopting the joint structure of VAE and GAN, the video similarity can be more reliably measured by generating better highlevel representations [Larsen et al.2015]. The cycle structure (forward GAN and backward GAN) convert from original to summary video and back again, in which information loss is minimized.
Given a video of frames, the first step is to extract its deep features via a deep CNN model. Given these features , the selector predicts a sequence of importance scores indicating the importance level of corresponding frames. During training, the frame feature of summary video . But for testing, CycleSUM outputs discretized importance scores ; then frames with importance scores being 1 are selected.
With and , the supervisor performs cycleconsistent learning (see Fig. 2) to evaluate the quality of summary video w.r.t. both completeness and compactness. Specifically, within the selector, the forward generator takes the current summary video as input and outputs a sequence of reconstructed features for the original video, namely . The paired discriminator then estimates the distribution divergence between original video and summary video in the learned feature space. The backward generator and discriminator have a symmetrical network architecture and training procedure to the forward ones. In particular, the generator takes the original video feature as input, and outputs to reconstruct the summary video. The discriminator then tries to distinguish between and . The forward cycleconsistency processing and the backward cycleconsistency are implemented to enhance the information consistency between and . This cycleconsistent processing guarantees the original video to be reconstructed from the summary video and vice versa, meaning the summary video can tell the same story as the original.
Training Loss
We design the following loss functions to train our CycleSUM model. The sparsity loss
is used to control the summary length for the selector; the prior loss and the reconstruction loss are used to train the two VAEbased generators; adversarial losses are derived from the forward GAN and backward GAN; and is the cycleconsistent loss.Sparsity loss
This loss is designed to penalize the number of selected frames forming the summary video over the original video. A high sparsity ratio gives a shorter summary video. Formally, it is defined as
where is the total number of video frames and is a predefined percentage of frames to select in video summarization. The ground truth of in standard benchmark is 15% [Gygli et al.2014, Song et al.2015], but we empirically set as 30% for the selector in training [Mahasseni, Lam, and Todorovic2017].
Generative loss
We adopt VAE as the generator for reconstruction, thus contains the prior loss and the reconstruction loss . For the forward VAE (forward generator ), the encoder encodes input features into the latent variable . Assume is the prior distribution of latent variables, and the typical reparameterization trick is to set
as Gaussian Normal distribution
[Kingma and Welling2013]. Define as the posterior distribution and as conditional generative distribution for , where is the parameter of the encoder and is that of the decoder. The objective of the forward generator iswhere the first term is KL divergence for the prior loss: .
The second term is an elementwise metric for measuring the similarity between samples, so we use it as the reconstruction loss . The typical reconstruction loss for auto encoder networks is the Euclidean distance between input and reconstructed output: . According to [Larsen et al.2015]
, elementwise metrics cannot model properties of human visual perception, thus they propose to jointly train the VAE (the generator) and the GAN discriminator, where hidden representation is used in the discriminator to measure sample similarity. Our proposed CycleSUM also adopts the same structure to measure the video distance and achieves a featurewise reconstruction. Specifically, if
and are the input and output of the VAE (the generator), the output of the last hidden layer of the discriminator are and . Then, consider . The expectation can be computed by empirical average. The reconstruction loss can be rewritten asThe backward generator has the same and as the forward generator. The only difference is that their input and output are reversed.
Adversarial loss
The learning objective of the evaluator is to maximize the mutual information between the original and summary video. According to Eqn. (4), the adversarial losses for the forward GAN ( and ) and the backward GAN ( and ) are
To avoid mode collapse and improve stability of optimization, we use the loss suggested in Wasserstein GAN [Arjovsky, Chintala, and Bottou2017]:
CycleConsistent Loss
Since we expect summary frame features to contain all the information of original frame features, the original video should be fully reconstructed from them. Thus, when we convert from original to summary video and then back again, we should obtain a video similar to the original one. In this way we can safely guarantee the completeness of the summary video. This processing is more advantageous than the onedirection reconstruction in existing image reconstruction works [Zhu et al.2017, Yi et al.2017]. Based on such an intuition, we introduce the below cycleconsistent loss. The procedure for forward cycle is . The is correspondingly defined as . For the backward cycle, the procedure is . So the is . We adopt distance for , since the often leads to blurriness [Larsen et al.2015, He et al.2016].
Overall Loss
The overall loss function is the overall objective for training the CycleSUM model:
where , and are hyper parameters to balance adversarial processing, generative processing and cycleconsistent processing.
Training CycleSUM
Given the above training losses and final objective function, we adopt the Stochastic Gradient Variational Bayes estimation [Kingma and Welling2014] to update the parameters in training. The selector and the generators in the CycleSUM are jointly trained to maximally confuse the discriminators. To stabilize the training process, we initialize all parameters with Xavier [Glorot and Bengio2010] and clip all parameters [Arjovsky, Chintala, and Bottou2017]. The clipping parameter in this training falls in . The typical value for the generator iteration per discriminator iteration is , which means the generator will iterate times per discriminative iteration. Algorithm 1 summarizes all steps for training CycleSUM.
Experiment
Experiment Setup
Datasets and Protocol
We evaluate CycleSUM on two benchmark datasets: SumMe [Gygli et al.2014] and TVSum [Song et al.2015]. The SumMe contains 25 videos ranging from 1 to 7 minutes, with framelevel binary importance scores. The TVSum contains 50 videos downloaded from YouTube, with shotlevel importance scores for each video taking constant from 1 to 5.
Following the convention [Gygli, Grabner, and Van Gool2015, Zhang et al.2016, Mahasseni, Lam, and Todorovic2017]
, we adopt the Fmeasure as the performance metric. Given ground truth and produced summary video, we calculate the harmonic mean FScores according to precision and recall for evaluation.
For the TVSum dataset, shotlevel ground truths are provided while the outputs of CycleSUM in testing are framelevel scores. Thus we follow the method in [Zhang et al.2016] to convert framelevel evaluation to shotlevel evaluation.
Implementation Details
For fairness, the frame features used for training our model are the same with [Zhang et al.2016, Mahasseni, Lam, and Todorovic2017]. We extract 1024d frame features from the output of pool5 layer of the GoogLeNet network [Szegedy et al.2015]
which is pretrained on ImageNet
[Deng et al.2009].Each of the two generators in our CycleSUM is a VAEbased LSTM network consisting of an encoder and a decoder, which has twolayers with hidden units per layer. The decoder LSTM which reconstructs the sequence reversely is easier to train [Srivastava, Mansimov, and Salakhudinov2015]
, so the decoders in CycleSUM also reconstruct the frame features in a reverse order. The discriminators are LSTM networks followed by a fullyconnected network to produce probability (true or false) for the input. Following the architecture of WGAN
[Arjovsky, Chintala, and Bottou2017], we remove the Sigmoid function in the last layer of the discriminator to make the model easier to train. The selector is a BiLSTM network consisting of three layers, each with
hidden units.The two VAE generators are initialized by pretraining on frame features of the original video. Similar to [Mahasseni, Lam, and Todorovic2017], such an initialization strategy can also accelerate training and improve overall accuracy.
All experiments are conducted for five times on five random splits and we report the average performance.
Quantitative Results
SumMe  TVSum  

De Avila, et al. [De Avila et al.2011]  33.7   
Li, et al. [Li and Merialdo2010]  26.6   
Khosla, et al. [Khosla et al.2013]    50 
Song, et al. [Song et al.2015]  26.6  50 
SUMGAN [Mahasseni, Lam, and Todorovic2017]  39.1  51.7 
CycleSUM  41.9  57.6 
We compare our CycleSUM model with several unsupervised stateofthearts in Tab. 1. One can see that the CycleSUM model outperforms all of them by a margin up to 2.8%. In particular, CycleSUM outperforms SUMGAN [Mahasseni, Lam, and Todorovic2017] across the two datasets, clearly demonstrating effectiveness of our proposed cycleconsistent loss. On TVSum, the performance improvement is over 5.9%. These results well prove the superior performance of CycleSUM for video summarization.
Ablation Analysis of CycleSUM
SumMe  TVSum  

CycleSUMC  34.8  49.5 
CycleSUM1G  38.2  51.4 
CycleSUM2G  39.7  53.6 
CycleSUMGf  40.3  55.2 
CycleSUMGb  39.9  55.0 
CycleSUM  41.9  57.6 
We further conduct ablation analysis to study the effects of different components of the CycleSUM model, including the generative adversarial learning and cycleconsistent learning. In particular, we consider following ablation variants of CycleSUM.

CycleSUMC. This variant is proposed to verify the effects of adversarial learning. It drops the adversarial loss and and keeps all other losses, especially cycleconsistent loss in the overall loss.

CycleSUM2G. The cycleconsistent loss and are removed in overall loss. The forward GAN and backward GAN are still kept. We compare the results of CycleSUM2G with CycleSUM to analyze the functions of cycleconsistent reconstruction.

CycleSUM1G. The cycleconsistent loss and are not used when training this variant. Meanwhile, we remove the generator and discriminator , so there is no backward reconstruction: . This model is similar to SUMGAN [Mahasseni, Lam, and Todorovic2017]. It only has forward GAN, and CycleSUM2G contains the two GANs during training. We are also interested in comparing CycleSUM2G and CycleSUM1G.

CycleSUMGf. The backward cycleconsistent loss is not included in overall objective when training, while forward cycleconsistent loss is still kept. The forward and backward adversarial learning are still kept.

CycleSUMGb. The forward cycleconsistent loss is not included in overall objective when training, and backward one is kept. The forward and backward adversarial learning are still kept.
Comparing Fscores of CycleSUM and CycleSUMC in Tab. 2, we can see the adversarial learning improves the performance significantly, proving the positive effects of deploying GAN in unsupervised video summarization. Compared with one GAN variant CycleSUM1G, the results of CycleSUM2G have 2% gain on average. Both comparisons verify adversarial learning helps improve video summarization.
By comparing CycleSUM2G and CycleSUM, we can see averagely Fscores rise by 2.2% on SumMe and 4.0% on TVSum. This proves that cycleconsistent reconstructions can ensure the summary video contain full information of the original video.
The results of CycleSUMGf are slightly better than CycleSUMGb. However, both variants bring performance gain over CycleSUM2G, which also proves the forward and backward cycleconsistent processing can promote the ability to select a fully summary video from the original.
To sum up, the adversarial learning ensures the summary and original video to keep a suitable distance in the deep feature space, and the cycleconsistent learning ensures selected frames to retain full information of the original video.
Qualitative Results
Fig. 3 shows summarization examples from a sample video in TVSum. We compare the selected frames of CycleSUM with other two recent stateofthearts, vsLSTM [Zhang et al.2016] and SUMGAN [Mahasseni, Lam, and Todorovic2017] by using a successful example for all three models. As shown in Fig. 3, CycleSUM selects shorter but more key shots than the other two models. Compared with results of vsLSTM and SUMGAN, some topicspecific and informative details, e.g. frames showing the doctor pushing medicinal liquid into dog’s ear, are correctly selected by CycleSUM.
Conclusion
In this paper, we theoretically reveal how to effectively maximize mutual information by cycleconsistent adversarial learning. Based on the theoretical analysis, we propose a new CycleSUM model for framelevel video summarization. Experimental results show that the cycleconsistent mechanism can significantly improve video summarization, and our CycleSUM can produce more precise summary video than strong baselines, which well validates effectiveness of our method.
Acknowledgments
This work was spported in part to Jiashi Feng by NUS IDS R263000C67646, ECRA R263000C87133 and MOE TierII R263000D17112, in part to Ping Li by NSFC under Grant 61872122, 61502131, and in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LY18F020015.
References
 [Arjovsky, Chintala, and Bottou2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein gan. arXiv preprint arXiv:1701.07875.
 [Cong, Yuan, and Luo2012] Cong, Y.; Yuan, J.; and Luo, J. 2012. Towards scalable summarization of consumer videos via sparse dictionary selection. IEEE Transactions on Multimedia 14(1):66–75.
 [De Avila et al.2011] De Avila, S. E. F.; Lopes, A. P. B.; da Luz Jr, A.; and de Albuquerque Araújo, A. 2011. Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32(1):56–68.
 [Deng et al.2009] Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; and FeiFei, L. 2009. Imagenet: A largescale hierarchical image database. In CVPR, 248–255. IEEE.
 [Ejaz, Mehmood, and Baik2013] Ejaz, N.; Mehmood, I.; and Baik, S. W. 2013. Efficient visual attention based framework for extracting key frames from videos. Signal Processing: Image Communication 28(1):34–44.
 [Glorot and Bengio2010] Glorot, X., and Bengio, Y. 2010. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 249–256.
 [Gong et al.2014] Gong, B.; Chao, W.L.; Grauman, K.; and Sha, F. 2014. Diverse sequential subset selection for supervised video summarization. In Advances in Neural Information Processing Systems, 2069–2077.
 [Goodfellow et al.2014] Goodfellow, I.; PougetAbadie, J.; Mirza, M.; Xu, B.; WardeFarley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Advances in neural information processing systems, 2672–2680.
 [Gygli et al.2014] Gygli, M.; Grabner, H.; Riemenschneider, H.; and Van Gool, L. 2014. Creating summaries from user videos. In ECCV, 505–520. Springer.
 [Gygli, Grabner, and Van Gool2015] Gygli, M.; Grabner, H.; and Van Gool, L. 2015. Video summarization by learning submodular mixtures of objectives. In CVPR, 3090–3098.
 [He et al.2016] He, D.; Xia, Y.; Qin, T.; Wang, L.; Yu, N.; Liu, T.; and Ma, W.Y. 2016. Dual learning for machine translation. In Advances in Neural Information Processing Systems, 820–828.
 [Khosla et al.2013] Khosla, A.; Hamid, R.; Lin, C.J.; and Sundaresan, N. 2013. Largescale video summarization using webimage priors. In CVPR, 2698–2705.
 [Kim, Sigal, and Xing2014] Kim, G.; Sigal, L.; and Xing, E. P. X. 2014. Joint summarization of largescale collections of web images and videos for storyline reconstruction. In CVPR. IEEE.
 [Kingma and Welling2013] Kingma, D. P., and Welling, M. 2013. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114.
 [Kingma and Welling2014] Kingma, D. P., and Welling, M. 2014. Stochastic gradient vb and the variational autoencoder. In ICLR.
 [Kuanar, Panda, and Chowdhury2013] Kuanar, S. K.; Panda, R.; and Chowdhury, A. S. 2013. Video key frame extraction through dynamic delaunay clustering with a structural constraint. Journal of Visual Communication and Image Representation 24(7):1212–1227.
 [Larsen et al.2015] Larsen, A. B. L.; Sønderby, S. K.; Larochelle, H.; and Winther, O. 2015. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300.
 [Li and Merialdo2010] Li, Y., and Merialdo, B. 2010. Multivideo summarization based on videommr. In International Workshop on Image Analysis for Multimedia Interactive Services, 1–4. IEEE.
 [Lu and Grauman2013] Lu, Z., and Grauman, K. 2013. Storydriven summarization for egocentric video. In CVPR, 2714–2721. IEEE.

[Ma et al.2002]
Ma, Y.F.; Lu, L.; Zhang, H.J.; and Li, M.
2002.
A user attention model for video summarization.
In ACM Multimedia, 533–542. ACM.  [Mahasseni, Lam, and Todorovic2017] Mahasseni, B.; Lam, M.; and Todorovic, S. 2017. Unsupervised video summarization with adversarial lstm networks. In CVPR.
 [Mei et al.2015] Mei, S.; Guan, G.; Wang, Z.; Wan, S.; He, M.; and Feng, D. D. 2015. Video summarization via minimum sparse reconstruction. Pattern Recognition 48(2):522–533.
 [Meng et al.2016] Meng, J.; Wang, H.; Yuan, J.; and Tan, Y.P. 2016. From keyframes to key objects: Video summarization by representative object proposal selection. In CVPR, 1039–1048.
 [Potapov et al.2014] Potapov, D.; Douze, M.; Harchaoui, Z.; and Schmid, C. 2014. Categoryspecific video summarization. In ECCV, 540–555. Springer.
 [Pritch et al.2007] Pritch, Y.; RavAcha, A.; Gutman, A.; and Peleg, S. 2007. Webcam synopsis: Peeking around the world. In ICCV, 1–8. IEEE.
 [Song et al.2015] Song, Y.; Vallmitjana, J.; Stent, A.; and Jaimes, A. 2015. Tvsum: Summarizing web videos using titles. In CVPR, 5179–5187.
 [Srivastava, Mansimov, and Salakhudinov2015] Srivastava, N.; Mansimov, E.; and Salakhudinov, R. 2015. Unsupervised learning of video representations using lstms. In ICML, 843–852.
 [Szegedy et al.2015] Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A.; et al. 2015. Going deeper with convolutions. In CVPR.
 [Wang et al.2012] Wang, M.; Hong, R.; Li, G.; Zha, Z.J.; Yan, S.; and Chua, T.S. 2012. Event driven web video summarization by tag localization and keyshot identification. IEEE Transactions on Multimedia 14(4):975–985.

[Yi et al.2017]
Yi, Z.; Zhang, H.; Tan, P.; and Gong, M.
2017.
Dualgan: Unsupervised dual learning for imagetoimage translation.
arXiv preprint. 
[Zhang et al.2016]
Zhang, K.; Chao, W.L.; Sha, F.; and Grauman, K.
2016.
Video summarization with long shortterm memory.
In ECCV, 766–782. Springer.  [Zhu et al.2017] Zhu, J.Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint arXiv:1703.10593.
Comments
There are no comments yet.