Log In Sign Up

DTR-GAN: Dilated Temporal Relational Adversarial Network for Video Summarization

The large amount of videos popping up every day, make it is more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. In this paper, we propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it can select a set of key frames, which contains the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced for enhancing temporal representation capturing. The generator aims to select key frames by using DTR units to effectively exploit global multi-scale temporal context and to complement the commonly used Bi-LSTM. To ensure that the summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The three-player loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on two public datasets SumMe and TVSum show the superiority of our DTR-GAN over the state-of-the-art approaches.


page 1

page 10

page 11

page 12


Query-Conditioned Three-Player Adversarial Network for Video Summarization

Video summarization plays an important role in video understanding by se...

Cycle-SUM: Cycle-consistent Adversarial LSTM Networks for Unsupervised Video Summarization

In this paper, we present a novel unsupervised video summarization model...

ERA: Entity Relationship Aware Video Summarization with Wasserstein GAN

Video summarization aims to simplify large scale video browsing by gener...

Unsupervised Video Summarization with a Convolutional Attentive Adversarial Network

With the explosive growth of video data, video summarization, which atte...

Generative Adversarial Network for Abstractive Text Summarization

In this paper, we propose an adversarial process for abstractive text su...

Reconstructive Sequence-Graph Network for Video Summarization

Exploiting the inner-shot and inter-shot dependencies is essential for k...

SalSum: Saliency-based Video Summarization using Generative Adversarial Networks

The huge amount of video data produced daily by camera-based systems, su...

I Introduction

Driven by the large number of videos that are being produced every day, video summarization [1, 2, 3] plays an important role in extracting and analyzing key contents within videos. Video summarization techniques have recently gained increasing attention in an effort to facilitate large-scale video distilling [4, 5, 6, 7] due to its promising significance. They aim to generate summaries by selecting a small set of key frames/shots in the video while still conveying the whole story, and thus can improve efficiency of key information extraction and understanding.

Fig. 1: The proposed DTR-GAN aims to extract key frames which depict the original video in a complete and compact way. The DTR units are introduced to complement the commonly used Bi-LSTM, in order to better capture long-range temporal dependencies. The adversarial network with the supervised loss for the generator and the three-player discriminator loss, acts as a form of regularization to obtain better summarization results.

Fig. 2: The network architecture of our DTR-GAN. Taking a video sequence as inputs, we can obtain appearance features of all frames by passing the original frames into the pretrained ResNet-152 model. The generator used to predict key frames consists of three components: 1) a temporal encoding module that integrates both bi-direction LSTM units and stacked DTR units is employed to produce enhanced features of each frame; 2) the confidence scores of all frames are then predicted by passing the features into the module ; 3) as another branch, the enhanced features of all frames are combined into the features by the module . The discriminator then uses and to generate representations of three summaries, i.e. ground-truth summary , predicted summary and randomly selected summary . These three summary representations are then concatenated with the encoded features of the original video and are further fed into a shared Bi-LSTM module to get a real and two fake losses , , in order to justify their fidelity.

Essentially, video summarization techniques need to address two key challenges in order to provide effective summarization results: 1) how to exploit a good key-frame/key-shot selection policy that can take into account the long-range temporal correlations embedded in the whole video to determine the uniqueness and importance of each frame/shot ; 2) from a global perspective, how to ensure that the resulting short summary can capture all key contents of the video with a minimal number of frames/shots, that is, how to ensure video information completeness and summary compactness.

Previous works have made some attempts toward solving these challenges. For instance, video summarization methods have to a large extent made use of LSTMs [5, 6] and determinantal point processes (DPP) [8, 9, 2] in order to address the first challenge and learn temporal dependencies. However, due to the fact that memory in LSTMs and DPPs are limited, we believe that there is still room to better exploit long-temporal relations in the videos.

The second challenge is often addressed by utilizing feature-based approaches, i.e. instance motion features learning [1, 10, 11], to encourage diversity between the frames included in the summary. However, this cannot ensure the information completeness and compactness of summaries, leading to redundant frames and less informative results. A recent work [6]

utilizing adversarial neural networks reduces redundancy by minimizing the distance between training videos and the distribution of summaries, but it encodes all different information into one fixed-length representation, which reduces the model learning capabilities given different length of video sequences.

To better address the two core challenges in the video summarization task, namely modeling of long-range temporal dependencies and information completeness and compactness, we propose a novel dilated temporal relational generative adversarial network (DTR-GAN). Figure 1 shows an overview of the proposed method. The generator, which consists of Dilated Temporal Relational (DTR) units and a Bi-LSTM, takes the real summary and the video representation as the input. DTR units aim to exploit long-range temporal dependencies complementing the commonly used LSTMs. The discriminator takes three pairs of input: generated summary pair, real summary pair and random summary pair and optimizes a three-player loss during training. To better ensure the completeness and compactness, we further introduce a supervised generator loss during adversarial training as a form of regularization.

DTR units integrate context among frames at multi-scale time spans, in order to enlarge the model’s temporal field-of-view and thereby effectively model temporal relations among frames. We use three layers of DTR units, each modeling four different time spans, to capture short-term, mid-term and long-term dependencies. Combining DTR units with the LSTMs ensures that the generator can have better generating ability. The discriminator is cast to discriminate real summaries from the generated summary, which further enhances the ability of the generator. At the same time, to ensure that the video representations are not learned from a trivial randomly shorten sequence, we propose to reformulate the traditional learning objective of the adversarial network as a three-player loss. This ensures more robust and accurate results by forcing the discriminator to effectively regularize the model.

Our approach essentially achieves better model capability with DTR units by exploiting the global multi-scale temporal context. Further, the three-player loss-based adversarial network also provides more effective regularization to improve the discriminator’s ability to recognize real summaries from fake ones. This, in turn, leads to better generated summaries. Evaluation on two public benchmark datasets SumMe [12] and TVSum [13] demonstrate the effectiveness of our proposed method compared to state-of-art approaches.

In summary, this paper makes the following contributions:

  • DTR-GAN. We propose a novel dilated temporal relational generative adversarial network for video summarization, which excels at capturing long-range global dependencies of temporal context in videos and generating a compact subset of frames with good information completeness and compactness. The experiments on two public datasets SumMe [12] and TVSum[13] demonstrate the superiority of our framework.

  • DTR units. We develop Dilated Temporal Relational (DTR) units to depict global multi-scale temporal context and complement the commonly used Bi-LSTM. DTR units dynamically capture different levels of temporal relations with respect to different hole sizes, which can enlarge the model’s field-of-view and thus better capture the long-range temporal dependencies.

  • Adversarial network with three-player loss. We design a new adversarial network with a three-player loss, which adds regularization to improve both the model abilities during adversarial training. Different from the traditional two-player loss, we introduce a generated summary loss, a random summary loss and the real summary (ground-truth) loss, to better learn summaries as well as avoid selecting random trivial short sequences as the results.

Ii Related Work

Ii-a Video Summarization.

Recent video summarization works apply both deep learning frameworks and other traditional technique to achieve key frame/shot-level summarization, leading to a significant improvement on this task. For example, Gyglie et al. 

[11] formulate it as a subset selection problem and use submodular maximization to learn a linear combination of adapted submodular functions. In [9], egocentric video summarization is achieved by using gaze tracking information (such as fixation and saccade). They also use submodular function maximization to ensure relevant and diverse summaries. Zhao and Xing [14] propose onLIne VidEo highLIGHTing (LiveLight), which can generate a short video clip in an online manner via dictionary learning, thus it enables to start processing arbitrarily long videos without seeing the entire video. Besides, Zhang et al. [15] also adopt dictionary learning using the methodology of sparse coding with generalized sparse group lasso to ensure retaining most informative features and relationships. They focus on individual local motion regions and their interactions between each other.

More recently, works using deep learning frameworks have been proposed and have achieved great progress. Zhou et al. [16]

use a deep summarization network via reinforcement learning to achieve both supervised and unsupervised video summarization. They design a novel reward function that jointly takes diversity and representativeness of generated summaries into account. Ji et al. 


formulate the video summarization as a sequence-to-sequence learning problem and introduce an attentive encoder-decoder network (AVS) to obtain key video shots. They use Long Short-Term Memory (LSTM) networks 

[18] for both encoder and decoder for exploring contextual information. Zhang et al. [5]

also use LSTM networks. They propose a supervised learning technique by using LSTM to automatically select both keyframes and key subshots, which is complemented with Determinantal Point Processes (DPP) 

[19] for modeling inter-frame repulsiveness to encourage diversity of generated summaries. There are some other works on DPP. Gong et al. [8] propose sequential determinantal point process (seqDPP), which heeds the inherent sequential structures in video data and retains the power of modeling diverse subsets, so that good summaries possessing multiple properties can be created. In [20], keyframe-based video summarization is performed by nonparametrically transferring structures from human-created summaries to unseen videos. They use DPP for extracting globally optimal subsets of frames to generate summaries. In [21], a pairwise deep ranking model is employed to learn the relationship between highlight and non-highlight video segments, to discover highlights in videos. They design the model with spatial and temporal streams, followed by the combination of the two components as the final highlight score for each segment.

Moreover, in [3], videos are summarized into key objects by selecting most representative object proposals which are generated from videos. Thus a fine-grained video summarization is achieved and what objects appear in each video can be told. Later, Atsushi et al. [22]

build a summary depending on the users viewpoints, as a way of inferring what the desired viewpoint may be from multiple groups of videos. They take video-level semantic similarity into consideration to estimate the underlying users’ viewpoints and thus generate summaries by jointly optimizing inner-summary, inner-group and between-group variances defined on feature representation.

More recently, the video summarization task is also performed by using vision-language joint embeddings. For example, Chu et al. [23] exploit video visual co-occurrence across multiple videos by using a topic keyword for each video. They develop a Maximal Biclique Finding (MBF) algorithm to find shots that co-occur most frequently across videos. Plummer et al. [7] train image features paired with text annotations from both same and different domains, by projecting video features into a learned joint vision-language embedding space, to capture the story elements and enable users to guide summaries with free-form text input. Panda et al. [24] summarize collections of topic-related videos with topic keywords. They introduce a collaborative sparse optimization method with a half-quadratic minimization algorithm, which captures both important particularities arising in a given video and generalities arsing across the whole video collection.

Ii-B Generative Adversarial Networks(GANs).

Generative Adversarial Networks (GANs) [25] consist of two components, a generator network and a discriminator network with an adversarial learning. The generator works on fitting the true data distribution while confusing the discriminator, whose task it is to discriminate true data from fake one. They have successfully been used in many fields.

Recently GANs have been used widely for many vision problems such as image-to-image translation 

[26], image generation [27, 28], representation learning [29, 30] and image understanding [31, 32]. For example, Zhu et al. [26] use cycle-consistent adversarial networks to translate images from source domain to target domain in the absence of paired examples. In [27], a text-conditional convolutional GAN is developed for generating images based on detailed visual descriptions, which can effectively bridge the characters and visual pixels.

To the best of our knowledge, the only existing GAN-based video summarization approach is [6]. In [6]

, video summarization is formulated as selecting a sparse subset of video frames in an unsupervised way. In their work, they develop a deep summarizer network for learning to minimize the distance between training videos and the distribution of their summarizations. The model consists of an autoencoder LSTM as the summarizer and another LSTM as the discriminator. Thus the summarizer LSTM is trained to confuse the discriminator, which forces the summarizer to obtain better summaries. We also explore a frame-level video summarization method using a GAN-based architecture. Different from above work 

[6], we design a three-player loss that takes the random summary, generated summary and ground-truth summary into account, to provide better regularizations. Moreover, in our generator network, we also introduce DTR units which can enhance the temporal context representation.

Fig. 3: An illustration of the proposed Dilated Temporal Relational (DTR) unit. Given a video sequence with frames, where each frame has an appearance feature , our DTR units dynamically capture different level of temporal relations by varying the hole sizes for integrating temporal contexts from multi-range neighboring frames. As shown in (a), a DTR layer contains four DTR units with different hole sizes, each is a concatenation with temporal convolution. For a certain value of , a new temporal relation range , ranging from , is obtained. After that, a summation operation is used to merge the learned output together, as the output at the layer of DTR network, which has three DTR layers in total. The whole DTR network architecture is shown in (b). It takes the appearance features for all frames

as the input and uses three DTR layers following a batch normalization layer and a relu layer for each DTR layer. After each DTR layer, learned representation

, and are obtained. The final learned feature is illustrated as . By combining different temporal information from each DTR unit with respect to different , we can enhance features of each frame by integrating multi-scale temporal contexts.

Iii Our Approach

The proposed DTR-GAN framework aims to resolve the key frame-level video summarization problem by jointly training a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. Figure 2 illustrates the overall learning process. In the generator, the DTR units help exploit global multi-scale temporal context, as a complement LSTMs, to select most representative frames. In the discriminator, three losses including a generated summary loss, a random summary loss, and a real summary loss are applied together to enhance the compactness and completeness of the summary. In the following sections, we first introduce the new dilated temporal relational (DTR) units for effectively capturing long-range and multi-scale temporal contexts to facilitate the summarization. We then present the details of our DTR-GAN network with a novel three-player loss.

Iii-a Dilated Temporal Relation Units

A desirable video summarization model should be capable of effectively exploiting the global temporal context embedded in future and past frames of the video in order to better determine the uniqueness and vital roles of each frame. We thus investigate how to achieve a good temporal context representation by introducing a new temporal relation layer.

Prior works often simply used various LSTM architectures to encode the temporal dynamic information in the video. However, models purely relying on the memory mechanism of LSTM units may fail to encode long-range temporal context, such as in video sequences exceeding 1000 time steps. Moreover, redundant frames often appear in a small neighborhood of each frame. Besides modeling the long-term temporal changes in the video, as can be done using LSTM units, it is, therefore, important to further model local and multi-scale temporal relations to obtain compact video summaries.

Inspired by the success achieved by atrous convolutions for long-range dense feature extraction 

[33] which employs atrous convolution in cascade or in parallel for multi-scale context capturing and temporal convolution networks [34] using a hierarchy of temporal convolutions, the key idea of our DTR unit is to capture temporal relational dependencies among video frames at multiple time scales. This is done by employing dilated convolutions across the temporal dimension, as illustrated in Figure 3.

Given a certain video sequence of frames in total, we denote the appearance features of all frames as . The features are extracted using the Resnet 152 [35] model, which has been pretrained on ILSVRC 2015 [36].

Formally, DTR units function on the above appearance features of the whole video by incorporating temporal relations among frames in different time spans , corresponding to different hole sizes . In our model, we use three DTR layers, each containing four different DTR units.

As shown in Figure 3(a), for each frame , a DTR layer enhances its feature using four DTR units, followed by the summation operation to merge all the information together, and generated the learned feature in the layer. The enhanced feature layer of the frame is computed as:


where denotes the number of different hole sizes used in each DTR layer, resulting in different . Each represents the transformation function that operates on the feature concatenation of , and , which has distinct parameters with respect to each hole size . The transformation is formulated using a temporal convolution along temporal dimension only and results in a learned temporal representation feature of the same size as . For each DTR unit, we empirically use and hole sizes of size 1, 4, 16, and 64. For each , the time span for capturing temporal relations of each frame corresponds to:


In Figure 3(b), an illustration of the DTR network with three layers of DTR units is shown. It takes the appearance feature of the video as the input. The output is defined as , and after different layers, following the batch normalization and ReLU operations, where at DTR layer. The final output of DTR network is defined as , where , which combines different temporal relations among video sequences. After summarizing the features obtained from , the appearance feature can be converted into a temporal-sensitive feature that explicitly encodes multi-scale temporal dependencies. The size of the filters are , where corresponds to the filter size along the feature dimension. The size of receptive field of filters can be computed as:


where the size of the receptive field is computed by different hole sizes at the layer. Here we use a filter of size .

The DTR network expands the filter’s field-of-view without any reduction in temporal resolution to model long-range temporal dependencies, which has the advantages over other spatio-temporal feature extractors, like [37] and [38]. In their work, they encode each video clip into a fixed descriptor and cannot produce the embedding results on a frame-level, which is required in our task for generating frame-level scores afterwards.

Each DTR unit models the temporal relationships by capturing neighboring features of different time span. Thus it can sense different neighboring features along the time space, and learns the dependencies among these different features. Besides, DTR also has the advantage of low computational complexity because of its simplicity. The proposed DTR unit is general enough to facilitate any network architectures to enhance temporal information encoding.

Iii-B Dtr-Gan

Iii-B1 Generator Network

As shown in Figure 2, given the appearance features of all frames, the generator aims to produce the confidence score of each frame being a key frame and the encoded compact video feature . The whole generator architecture is composed of three modules: the temporal encoding module for learning the temporal relations among frames, the compact video representation module for generating the learned visual feature of the whole video, and the summary predictor for obtaining the final confidence score of each frame.

a) Temporal Encoding Module . The module integrates a Bidirectional LSTM (Bi-LSTM) layer [39] and DTR network containing three DTR layers with twelve units in total, which encode both long-term temporal dependencies and multi-scale temporal relations with respect to different hole sizes.

The 2048-dimensional appearance features of all frames are taken as inputs of

. In the first branch, they are sequentially fed into one recurrent Bi-LSTM layer. The layer consists of both a backward and a forward path, each consisting of an LSTM with 1024 hidden cells, to ensure modeling of temporal dependencies both on past and future frames. We thus obtain an updated 2048-dimensional feature vector

for each frame as the concatenation of the forward and backward hidden states.

In the second branch, following Eq.(1), each DTR layer computes for each frame and passes it to the next DTR layer, and achieves multi-scale temporal relations among frames by making use of different hole sizes for better video representation. After passing over three DTR layers, we get the final evolved feature of each frame, and it is denoted as . Finally, the outputs of the module are two sets of updated features and for all frames.

b) Compact Video Representation Module . Given the outputs of the module , the encoded features of all frames are produced as , where . In our setting, we use as a concatenation function followed by a fully connected layer, to learn the merged representation for video encoding.

The outputs of model denoted as are also used as the input of the discriminator network with three-player loss, which will be discussed later in Section III-B2.

c) Summary Predictor . To predict confidence score for all frames as the video summary results, we introduce another summary predictor module , as . The score is obtained by first concatenating of and , and then passing the result to a fully-connected layer, a dropout layer and a batch normalization layer that outputs one value for each input frame. After that, a sigmoid non-linearity is applied to each output value to produce the summary score. In this way, the confidence scores of all frames are generated by summary predictor as the final summary results.

Iii-B2 Discriminator Network

In order to produce a high-quality summary, it is also desirable to evaluate whether the resulting summary encodes all main video contents of the original video and also consists of as few frames as possible from a global perspective. The key requirement is to measure the video correspondence between the obtained summary and the original video.

Different from the traditional discriminator architecture [25] that only focuses on justifying the fidelity of a generated sample, the discriminator of our DTR-GAN instead learns the correspondence between input video and resulting summary, which can be treated as a paired target. Furthermore, in order to ensure that the summary is informative, we present a three-player loss. Instead of the commonly used two-player loss [40, 26], this loss further enforces the discriminator to distinguish between the learned summary and a trivial summary consisting of randomly selected frames. The whole architecture is illustrated in Figure 2.

First, the inputs for the discriminator are three duplicates of the original video feature representation , each paired with a different summary. The summaries are the ground truth summary , the resulting summary of the generator, , and a random summary respectively. The representation of each summary is obtained based on the feature representation from the generator, allowing the discriminator to utilize the encoded temporal information.

Let us denote the ground truth summary score as , the resulting summary score as

, and the random summary score, which is sampled from a uniform distribution, as

. In this way, the summaries , and can be computed by multiplying the corresponding encoded frame-level features with the summary scores , and , respectively:


The discriminator D consists of four Bi-LSTM models, each with one layer, followed by a three-layer fully connected neural network and a sigmoid non-linearity to produce the discriminator score for the three pairs , and . All Bi-LSTMs have the same architecture but some of them have different parameters.

We pass the original video feature representation to one Bi-LSTM with a set of parameters, getting the hidden states, and pass the encoded summaries , and to the other three Bi-LSTM with shared parameters, also getting the hidden states. The forward and backward paths in the Bi-LSTM consist of 256 hidden units each. We can thus obtain three learned representation pairs for checking the fidelity of the true representation pair and the other two fake ones. Then we concatenate each pair followed three fully connected layers. The dimensions of three layers are 512, 256 and 128. After that, a sigmoid layer is applied for obtaining the discriminator scores for each pair.

Iii-B3 Adversarial learning

Inspired by the objective function proposed in recent work on Wasserstein GANs [40], which has good convergence properties and alleviates the issue of mode collapse, we optimize our adversarial objective with a three-player loss via a min-max game.

Specifically, given the three learned modules of the generator and the discriminator , we jointly optimize all of them in an adversarial manner. The global objective over real loss and the two fake losses , ensures that the summaries capture enough key video representation, as well as avoids the learning of a trivial randomly shorten sequence as the summary. The min-max adversarial learning loss can be defined as:


where is the balancing parameter between the resulting summary and the ground truth summary. By substituting into , and following Eq.(2), the objective can be reformulated as:


We treat each player equally since both of the two fake pairs contribute to forcing the discriminator to learn the compact and complete real summary from fake one. Thus we set the balancing parameter , that is, for both of the fake pairs, namely the pairs of the generated summary and the random summary.

To optimize the generator, we further incorporate a supervised frame-level summarization loss between the resulting summary and the ground truth summary during the adversarial training:


This loss aligns the generated summary with the real summary, guiding the generator to generate high-quality summaries by adding more regulations. The optimal generator can thus be computed as :


Fig. 4: The inference process of the proposed DTR-GAN. The final confidence score for each frame of being key frame is obtained by passing the visual representation features to the temporal encoding module and the summary predictor , without compact video representation model as well as the discriminator .

Iii-C Inference Process

The inference process can be shown in Figure 4. Given each testing video, the proposed DTR-GAN model takes the whole video sequence as input. It then generates the confidence scores of all frames as the final summary result using only the generator during the inference process. Specifically, the testing video is first passed to the temporal encoding module , generating the learn temporal representation, which can efficiently exploit global multi-scale temporal context. Then the summary predictor is applied to get the final predicted scores for each video.

Thus, the main differences for our DTR-GAN between training and inference phases are: 1) Discriminator is not used for inference, while training phase relies highly on it. 2) The compact Video Representation model, which is used to learn the merged video encoding for further training for discriminator , is not required during inference phase.

Iv Experiments

Iv-a Experimental Settings

Datasets. We evaluate our method on two public benchmark datasets for video summarization, i.e., SumMe [12] and TVSum [13]. The SumMe dataset contains 25 videos covering multiple events from both the first-person and the third-person view. The length of the videos ranges from 1 to 6 minutes. The TVSum dataset contains 50 videos capturing 10 categories in the TRECVid Multimedia Event Detection (MED) dataset. It contains many topics such as news, cooking and sports and the length of each video ranges from 1 to 5 minutes. Following the previous methods [5, 6], we randomly select 80% of the videos for training and 20% for testing.

Evaluation Metrics. For fair comparison, we adopt the same keyshot-based protocol [5] as in work [6, 16], i.e., the harmonic F-measure, to evaluate our method, quantifying the similarity between the generated summary and the ground-truth summary for each video. Given the generated summary A and the ground-truth summary B, the precision and recall of the temporal overlap are defined as:


the final harmonic F-measure is computed as:


We also follow the process of [5] to generate keyshot-level summaries from the key-frame level and the importance score-level summaries. We first apply the temporal segmentation method KTS [4] to get video segments. Then if a segment contains more than one key frame, we give all frames within that segment scores of . Afterwards, we select the generated keyshots under the constraint that the summary duration should be less than 15% of the duration of the original video by using the knapsack algorithm [13].

Implementation Details.

We implement our work using the TensorFlow 


framework, with 1 GTX TITAN X 12GB GPU on a single server. We set the learning rate as 0.0001 for the generator and 0.001 for the discriminator. During the training process, we experimentally train the generator twice and train the discriminator once in each epoch. We randomly select a shot with 1000 frames and 10% interval overlaps with neighboring shots to form each batch of the video in order to reduce the effect of edge artifacts. In test, we feed the whole video sequence as input, which can enable the model to sense the temporal dependencies in the whole time space.

Iv-B Comparison with the state-of-the-art methods

We compare our DTR-GAN with several supervised state-of-art methods to illustrate the advantages of our algorithm. Table I shows the quantitative results on the  SumMe and  TVSum datasets. It can be observed that our DTR-GAN substantially outperforms the other supervised state-of-art methods on both datasets. Particularly, on the SumMe dataset, DTR-GAN achieves 2.5% better performance than the state-of-art method [16] in terms of F-measure, and 1.0% better on TVSum. Such performance improvements indicate the superiority of our DTR-GAN in encoding long-term temporal dependencies and correlations for determining the importance of each frame. At the same time, this also illustrates the effectiveness of validating the information completeness and summary compactness from a global perspective using our three-player adversarial training approach.

Method SumMe [11] TVSum [13]
Interestingness [12] 39.3 -
Submodularity [11] 39.7 -
Summary transfer [20] 40.9 -
DPP-LSTM [5] 38.6 54.7
 [6] 41.7 56.3
[16] 42.1 58.1
DTR-GAN 44.6 59.1
TABLE I: Comparison results obtained by our method and other supervised approaches on SumMe [11] and TVSum [13] datasets in terms of harmonic F-measure.

From Table I, we can observe that our DTR-GAN achieves better performance (6.0% and 4.4% in terms of F-measure) than the DPP-LSTM work [5] on two datasets. In  [5]

, the DPP LSTM model is designed with containing two LSTM layers, one for modeling the forward direction video sequence, and the other for the backward direction. They also combine the LSTM layers’ hidden states and the input visual features with a multi-layer perceptron, together with the determinantal point process for enhancement. Thus, from the experimental results, we can come to the conclusion that DTR-GAN with LSTM and DTR networks can achieve better results by combining Bi-LSTM and DTR units together, allowing superior capturing of global multi-scale temporal relations.

Note that, another recent work [6] also adopted the adversarial networks on temporal features produced by LSTMs for video summarization. However, our DTR-GAN is different from it: 1) the generator in [6] encodes all different information into one fixed-length representation, which may reduce the model learning capabilities given different length of video sequence.; 2) our DTR-GAN further introduces a new three-player loss to avoid that the network selects random trivial short sequences as the results; 3) in the generator network, besides the traditional LSTM, we further incorporate a new DTR unit to facilitate the temporal relation encoding by further exploiting multi-scale local dependencies.

The most recent state-of-art work [16] achieves the best video summary result among the existing methods. The authors train deep summarization network based on LSTM networks via reinforcement learning. They design a reward function that jointly accounts for diversity and representativeness. In our work, we achieve 2.5% and 1.0% higher F-measure than [16], due to the fact that regularizes for generator in order to better obtain the summaries, as well as better temporal modeling by combining Bi-LSTM and DTR units.

Method DTR units Bi-LSTM G_gt_loss Discriminator SumMe [11]
DTR-GAN holes (1,4,16,64) three-player loss 44.6
DTR-GAN_(holes 1,2,4,16) holes (1,2,4,16) three-player loss 41.4
DTR-GAN_(holes 16,32,64,128) holes (16,32,64,128) three-player loss 42.6
DTR-GAN w/o Bi-LSTM in G holes (1,4,16,64) three-player loss 43.7
DTR-GAN w/o DTR units in G three-player loss 41.7
DTR-GAN w/o rand [40] holes (1,4,16,64) two-player loss 40.6
DTR-GAN_least square loss [42] holes (1,4,16,64) least square loss 42.9
DTR-GAN w G only holes (1,4,16,64) 40.8
DTR-GAN w/o G_gt_loss holes (1,4,16,64) three-player loss 41.9
TABLE II: Comparison of results for our ablation experiments, indicating the importance of the various components in our model for the SumMe [11] dataset in terms of harmonic F-measure. (The texts in blue color highlight the components that differ from the original DTR-GAN.)

Iv-C Ablation Analysis

We conduct extensive ablation studies to validate the effectiveness of different components in our model by experimenting with different model variants. The different ablation analyses and the varied model component combinations are as followed:

Iv-C1 Ablation Models

Comparisons of DTR units
  • DTR-GAN_(holes 1,2,4,16). DTR units with hole size of (1,2,4,16) for each layer in order to compare the proposed hole size of (1,4,16,64) with this variant that uses a smaller range of temporal modeling.

  • DTR-GAN_(holes 16,32,64,128). DTR units with hole size of (16,32,64,128) for each layer in order to compare the proposed hole size of (1,4,16,64) with this variant that uses a larger range of temporal modeling.

Comparisons of Each Temporal Encoding Module
  • DTR-GAN w/o Bi-LSTM in G. Drop the Bi-LSTM model in the generator in DTR-GAN to analyze the effect of the Bi-LSTM network in the proposed model.

  • DTR-GAN w/o DTR units in G. Drop the DTR network in the generator in DTR-GAN to analyze the effect of the DTR units in the proposed model.

Comparisons of Disciminator
  • DTR-GAN w/o rand. Apply two-player loss by dropping the random summary loss in the discriminator to analyze the effect of the three-player loss in the proposed model comparing with the commonly used two-player loss structure.

  • DTR-GAN_least square loss.

    Apply the Least Square loss function 

    [42] instead of the loss designed in Wasserstein GAN [40] to analyze the effect of loss functions in the proposed DTR-GAN.

Comparison of Adversarial Learning
  • DTR-GAN w G only. Drop the discriminator part with the adversarial training, and use only the generator in order to analyze the effect of adversarial learning in the proposed model.

Comparison of Supervised Loss
  • DTR-GAN w/o G_gt_loss. Drop the ground-truth loss in the generator to analyze the effect of the supervised loss for generating summaries with human annotated labels.

Iv-C2 Ablation Discussion

In Table II, we illustrate different settings including DTR units, Bi-LSTM, G_gt_loss and Discriminator components. As shown in the second row, the details of the proposed DTR-GAN are: hole sizes are 1,4,16, and 64 for DTR units in each DTR layer, Bi-LSTM and G_gt_loss are included, and three-player loss discriminator is applied. The rest rows show the different model variants for further ablation discussion, where the texts in blue color represent different components that differ from the proposed DTR-GAN model.

The Effect of DTR Units

As our DTR units employ multi-scale hole sizes to capture longer-range temporal dependencies, it is also interesting to explore the effect of selecting different hole sizes on the summarization performance. We have tested two additional hole size settings, namely (1,2,4,16) and (16,32,64,128), whereas the proposed setting in all other experiments corresponds to (1,4,16,64). The model with hole size of (1,2,4,16) obtains a smaller range of temporal dilation, while the model with hole size of (16,32,64,128) achieves a larger range of temporal dilation, compared with the proposed DTR-GAN model, which contains intermediate a larger variants of hole sizes to capture multi-scale temporal relations better.

From Table II, we can observe that both model variants achieve inferior performance to the 44.6% of DTR-GAN. Moreover, there is a minor performance difference between the results for larger and smaller dilation hole sizes.

The above comparison results indicate that with larger hole size we can obtain better results due to the larger time span. On the other hand, small holes are also required because of the fact that neighboring frames tend to share more similar features and have to some extent more temporal dependencies.

The Effect of Each Temporal Encoding Module

Note that in our generator network, we incorporate both the long-term LSTM units and multi-scale DTR units. By comparing model variants without either Bi-LSTM unit or our DTR unit with our full DTR-GAN, we can better demonstrate the effect of each module on the final summarization performance. It can be observed that the module capability is decreased by either removing Bi-LSTM units or DTR units.

From Table II, we can see that by removing Bi-LSTM module, the performance of our approach decreases by 0.9%. While by removing DTR units the performance decreases by 2.9%. This shows that our DTR units have more effect than the Bi-LSTM module and this is due to the fact that better multi-scale temporal dependency helps learn better video temporal representation resulting in more compact and complete summaries. Besides, it also shows that Bi-LSTM can enhance the performance for the whole model, so we combine these two models together for better video summarization generation.

The Effect of the Discriminator

We also test the performance of a model variant that only uses the standard two-player loss, i.e. the pairs of the original video with the ground truth summary and with the generated summary. This is to validate the effectiveness of our proposed three-player objective in the adversarial training, which is also based on the Wasserstein GAN structure [40]. We can observe that there is a large performance difference between standard two-player loss and our proposed three-player loss. The reason is that the random summary provides more regularization which ensures that the video representations are not learned from a trivial randomly shorten sequence.

Moreover, we replace the Wasserstein GAN with the Least Square GAN [42] structure with our proposed three-player loss. From Table II, we can see that the performance of this baseline is 42.9%, which is still 0.8% better than the result of previous state-of-art work [16]. This further demonstrates that our proposed approach does not reply on GAN structure.

The Effect of the Adversarial Learning Module

In addition, we also trained the model only using the generator. The performance of this baseline is only 40.8%, which is lower than most other ablation models and is 3.8% lower than the proposed DTR-GAN architecture. This demonstrates that the adversarial training with discriminator works better than non-adversarial training.

The discriminator functions to discriminate the ground-truth summary from generated and random summaries, which helps to enforce that the generator generates more complete and compact summaries.

The Effect of the Supervised Loss

During the adversarial training, we introduce the ground-truth loss for the generator as a form of regularization, by aligning the generated frame-level importance scores with the ground-truth scores.

From Table II, we can see that this model obtains better performance on frame-level video summarization with the supervised loss. Specifically, by removing the “G_gt_loss” component, the performance drops by 3.8%. This illustrates that our model can learn much better by using the human annotated labels.

Iv-D Qualitative Results.

To better demonstrate some key components of our framework, we visualize an example of the summary results overlaying the ground-truth frame-level important scores in Figure 5 and Figure 6. We use the selected key frames obtained via the importance scores that are generated by the generator as a summary.

Figure 5 illustrates the visualized results on the video Statue of Liberty in the SumMe dataset on “DTR-GAN”, “DTR-GAN w/o range”, “DTR-GAN w/o G_gt_loss” and “DTR-GAN w G only”. Figure 6 illustrates the visualized results on the video Bus in Rock Tunnel in the SumMe dataset on “DTR-GAN”, “DTR-GAN_(holes 1,2,4,16)”, “DTR-GAN_(holes 16,32,64,128)”, “DTR-GAN w/o DTR units in G”.

From this figure, we can see that visualized results comply with the quantitative results in Table II, where our model obtains reasonably better generated video summary results than the rest three models. All of the key components of our proposed framework contribute to improving overall performance.

Fig. 5: Video summarization results of some variants of our proposed DTR-GAN method for the video Statue of Liberty in SumMe [11]. The dark blue bars in b), c), d), e) are the ground-truth frame-level scores, and the colored segments are the summary results generated by different model variants.

Fig. 6: Video summarization results of some variants of our proposed DTR-GAN method for the video Bus in Rock Tunnel in SumMe [11]. The dark blue bars in b), c), d), e) are the ground-truth frame-level scores, and the colored segments are the summary results generated by different model variants.

V Conclusion

In this paper, we proposed DTR-GAN for frame-level video summarization. It consists of a DTR generator and a discriminator with three-player loss and is trained in an adversarial manner. Specifically, the generator combines two temporal dependency learning modules, Bi-LSTM and our proposed DTR network with three layers of four different hole sizes in each layer for multi-scale global temporal learning. In the discriminator, we use a three-player loss, which contains the generated summary, random summary, and ground-truth to introduce more restrictions during adversarial training. This helps the generator to generate more complete and compact summaries. Experiments on two public datasets SumMe and TVSum demonstrate the effectiveness of our proposed framework. For our future work, we plan to apply this framework to more general video summarization tasks like query-based video summarization [43, 44], to allow the generation of personalized summaries for individual users.


This project is supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. This work is also partially funded by the Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines.


  • [1] Z. Bin and X. E. P., “Quasi real-time summarization for consumer videos,” in

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , 2014.
  • [2] G. B. Sharghi Aidean and S. Mubarak, “Query-focused extractive video summarization,” in Proceedings of European conference on computer vision, 2016.
  • [3] J. Meng, H. Wang, J. Yuan, and Y.-P. Tan, “From keyframes to key objects: Video summarization by representative object proposal selection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 1039–1048.
  • [4] D. Potapov, M. Douze, Z. Harchaoui, and C. Schmid, “Category-specific video summarization,” in Proceedings of European conference on computer vision.   Springer, 2014, pp. 540–555.
  • [5] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Video summarization with long short-term memory,” in Proceedings of European conference on computer vision.   Springer, 2016, pp. 766–782.
  • [6] B. Mahasseni, M. Lam, and S. Todorovic, “Unsupervised video summarization with adversarial lstm networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, July 2017.
  • [7] B. A. Plummer, M. Brown, and S. Lazebnik, “Enhancing video summarization via vision-language embedding,” in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  • [8] B. Gong, W.-L. Chao, K. Grauman, and F. Sha, “Diverse sequential subset selection for supervised video summarization,” in Advances in Neural Information Processing Systems, 2014, pp. 2069–2077.
  • [9] J. Xu, L. Mukherjee, Y. Li, J. Warner, J. M. Rehg, and V. Singh, “Gaze-enabled egocentric video summarization via constrained submodular maximization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2235–2244.
  • [10] G. Kim, L. Sigal, and E. P. Xing, “Joint summarization of large-scale collections of web images and videos for storyline reconstruction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4225–4232.
  • [11] M. Gygli, H. Grabner, and L. Van Gool, “Video summarization by learning submodular mixtures of objectives,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3090–3098.
  • [12] M. Gygli, H. Grabner, H. Riemenschneider, and L. Van Gool, “Creating summaries from user videos,” in In Proceedings of European conference on computer vision.   Springer, 2014, pp. 505–520.
  • [13] Y. Song, J. Vallmitjana, A. Stent, and A. Jaimes, “Tvsum: Summarizing web videos using titles,” in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5179–5187.
  • [14] B. Zhao and E. P. Xing, “Quasi real-time summarization for consumer videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2513–2520.
  • [15] S. Zhang, Y. Zhu, and A. K. Roy-Chowdhury, “Context-aware surveillance video summarization,” IEEE Transactions on Image Processing, vol. 25, no. 11, pp. 5469–5478, 2016.
  • [16] K. Zhou, Y. Qiao, and T. Xiang, “Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward,” arXiv preprint arXiv:1801.00054, 2017.
  • [17] Z. Ji, K. Xiong, Y. Pang, and X. Li, “Video summarization with attention-based encoder-decoder networks,” arXiv preprint arXiv:1708.09545, 2017.
  • [18] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  • [19] A. Kulesza, B. Taskar et al., “Determinantal point processes for machine learning,” Foundations and Trends® in Machine Learning, vol. 5, no. 2–3, pp. 123–286, 2012.
  • [20] K. Zhang, W.-L. Chao, F. Sha, and K. Grauman, “Summary transfer: Exemplar-based subset selection for video summarization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1059–1067.
  • [21] T. Yao, T. Mei, and Y. Rui, “Highlight detection with pairwise deep ranking for first-person video summarization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 982–990.
  • [22] K. ZAtsushi, V. G. Luc, U. Yoshitaka, and H. Tatsuya, “Viewpoint-aware video summarization,” arXiv preprint arXiv:1804.02843v2, 2018.
  • [23] W.-S. Chu, Y. Song, and A. Jaimes, “Video co-summarization: Video summarization by visual co-occurrence,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  • [24] R. Panda and A. K. Roy-Chowdhury, “Collaborative summarization of topic-related videos,” in CVPR, vol. 2, no. 4, 2017, p. 5.
  • [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
  • [26] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” arXiv preprint arXiv:1703.10593, 2017.
  • [27] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee, “Generative adversarial text to image synthesis,” arXiv preprint arXiv:1605.05396, 2016.
  • [28] A. Ghosh, V. Kulharia, A. Mukerjee, V. Namboodiri, and M. Bansal, “Contextual rnn-gans for abstract reasoning diagram generation,” arXiv preprint arXiv:1609.09444, 2016.
  • [29] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems, 2016, pp. 2234–2242.
  • [30] M. F. Mathieu, J. J. Zhao, J. Zhao, A. Ramesh, P. Sprechmann, and Y. LeCun, “Disentangling factors of variation in deep representation using adversarial training,” in Advances in Neural Information Processing Systems, 2016, pp. 5040–5048.
  • [31] A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional generative adversarial networks,” arXiv preprint arXiv:1511.06434, 2015.
  • [32] X. Liang, L. Lee, W. Dai, and E. P. Xing, “Dual motion gan for future-flow embedded video prediction,” arXiv preprint, 2017.
  • [33] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,” arXiv preprint arXiv:1706.05587, 2017.
  • [34] C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” arXiv preprint arXiv:1611.05267, 2016.
  • [35] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al.

    , “Imagenet large scale visual recognition challenge,”

    International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [37] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4489–4497.
  • [38] C. Feichtenhofer, A. Pinz, and R. Wildes, “Spatiotemporal residual networks for video action recognition,” in Advances in neural information processing systems, 2016, pp. 3468–3476.
  • [39] A. Graves and J. Schmidhuber, “Framewise phoneme classification with bidirectional lstm and other neural network architectures,” Neural Networks, vol. 18, no. 5-6, pp. 602–610, 2005.
  • [40] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial nets,” In Proceedings of International Conference on Machine Learning, 2017.
  • [41] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin et al., “TensorFlow: Large-scale machine learning on heterogeneous systems,” 2015, software available from [Online]. Available:
  • [42] X. Mao, Q. Li, H. Xie, R. Y. Lau, Z. Wang, and S. P. Smolley, “Least squares generative adversarial networks,” in 2017 IEEE International Conference on Computer Vision (ICCV).   IEEE, 2017, pp. 2813–2821.
  • [43] A. Sharghi, J. S. Laurel, and B. Gong, “Query-focused video summarization: Dataset, evaluation, and a memory network based approach,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2017, pp. 2127–2136.
  • [44] A. B. Vasudevan, M. Gygli, A. Volokitin, and L. Van Gool, “Query-adaptive video summarization via quality-aware relevance estimation,” in Proceedings of the 2017 ACM on Multimedia Conference.   ACM, 2017, pp. 582–590.