Improving Sequential Determinantal Point Processes for Supervised Video Summarization

07/28/2018 ∙ by Aidean Sharghi, et al. ∙ MIT Microsoft University of Central Florida The University of Iowa 4

It is now much easier than ever before to produce videos. While the ubiquitous video data is a great source for information discovery and extraction, the computational challenges are unparalleled. Automatically summarizing the videos has become a substantial need for browsing, searching, and indexing visual content. This paper is in the vein of supervised video summarization using sequential determinantal point process (SeqDPP), which models diversity by a probabilistic distribution. We improve this model in two folds. In terms of learning, we propose a large-margin algorithm to address the exposure bias problem in SeqDPP. In terms of modeling, we design a new probabilistic distribution such that, when it is integrated into SeqDPP, the resulting model accepts user input about the expected length of the summary. Moreover, we also significantly extend a popular video summarization dataset by 1) more egocentric videos, 2) dense user annotations, and 3) a refined evaluation scheme. We conduct extensive experiments on this dataset (about 60 hours of videos in total) and compare our approach to several competitive baselines.



There are no comments yet.


page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

It is now much easier than ever before to produce videos due to ubiquitous acquisition capabilities. The videos captured by UAVs and drones, from ground surveillance, and by body-worn cameras are easily reaching the scale of gigabytes per day. In 2017, it was estimated that there were at least 2.32 billion active camera phones in the world 

[1]. In 2015, 2.4 million GoPro body cameras were sold worldwide [2]. While the big video data is a great source for information discovery and extraction, the computational challenges are unparalleled. Automatically summarizing the videos has become a substantial need for browsing, searching, and indexing visual content.

Under the extractive video summarization framework, a summary is composed of important shots of the underlying video. This notion of importance, however, varies drastically from work to work in the literature. Wolf defines the importance as a function of motion cues [3]. Zhao and Xing formulate it by reconstruction errors [4]. Gygli et al. learn a mixture of interestingness, representativeness, and uniformity measures to find what is important [5]. These differences highlight the complexity of video summarization. The criteria for summarizing vastly depend on the content, styles, lengths, etc. of the video and, perhaps more importantly, users’ preferences. For instance, to summarize a surveillance video, a running action might flag an important event whereas in a football match it can be a normal action observed throughout the video.

To overcome those challenges, there are two broad categories of approaches in general. One is to reduce the problem domain to a homogeneous set of videos which share about the same characteristics (e.g., length and style) so that experts can engineer some domain-specific criteria of good summaries [6, 7]. The other is to design models that can learn the criteria automatically, often from human-annotated summaries in a supervised manner [8, 9, 10, 11, 12]. The latter is more appealing because a model can be trained for different settings of choice, while the former is not as scalable.

This paper is also in the vein of supervised video summarization based on determinantal point process (DPP) [13]

. Arising from quantum physics and random matrix theories, DPP is a powerful tool to balance importance and diversity, two axiomatic properties in extractive video summarization. Indeed, a good summary must be collectively diverse in the sense that it should not have redundancy of information. Moreover, a shot selected into the summary must add value to the quality of the summary; otherwise, it is not important in the context of the summary. Thanks to the versatility of DPP and one of its extension called SeqDPP 

[8] for handling sequences, they have been employed in a rich line of recent works on video summarization [9, 10].

This paper makes two-pronged contribution towards improving these models to more effectively learn better video summarizers. In terms of learning, we propose a large-margin algorithm to address the SeqDPP’s exposure bias problem explained below. In terms of modeling, we design a new probabilistic block such that, when it is integrated into SeqDPP, the resulting model accepts user input about the expected length of the summary.

We first explain the exposure bias problem with the existing SeqDPP works — it is actually a mismatch issue in many sequence to sequence (seq2seq) learning methods [14, 15, 16, 17, 18]. When the model is trained by maximizing the likelihood of user annotations, the model takes as input user annotated “oracle” summaries. At the test time, however, the model generates output by searching over the output space in a greedy fashion and its intermediate conditional distributions may receive input from the previous time step that deviates from the oracle. In other words, the model is exposed to different environments in the training and testing stages, respectively. This exposure bias also results in the loss-evaluation mismatch [19] between the training phase and the inference. To tackle these issues, we adapt the Large-Margin algorithm originally derived for training LSTMs [20]

to the SeqDPPs. The main idea is to alleviate the exposure bias by incorporating inference techniques of the test time into the objective function used for training. Meanwhile, we add to the large-margin formulation a multiplicative reward term that is directly related to the evaluation metrics to mitigate the loss-evaluation mismatch.

In addition to the new large-margin learning algorithm, we also improve the SeqDPP model by a novel probabilistic distribution in order to allow users to control the lengths of system-generated video summaries. To this end, we propose a generalized DPP (GDPP) in which an arbitrary prior distribution can be imposed over the sizes of subsets of video shots. As a result, both vanilla DPP and -DPP [21] can be considered as special instances of GDPP. Moreover, we can conveniently substitute the (conditional) DPPs in SeqDPP by GDPP. When a user gives an expected length of the summary, we dynamically allocate it to different segments of the video and then choose the right numbers of video shots from corresponding segments.

We conduct extensive experiments to verify the improved techniques for supervised video summarization. First of all, we significantly extend the UTE dataset [22] and its annotations of video summaries and per-shot concepts [10] by another eight egocentric videos [23]. Following the protocol described in [10], we collect three user summaries for each of the hours-long videos as well as concept annotations for each video shot. We evaluate the large-margin learning algorithm on not only the proposed sequential GDPP but also the existing SeqDPP models.

2 Related work and background

We briefly review the related work in this section. Besides, we also describe the major body of DPPs and SeqDPPs. Readers are referred to [13] and [8]

for more details and properties of the two versatile probability models.

Supervised video summarization.

In recent years, data-driven learning to tackle research problems has attracted plenty of attention. This is mainly because they can learn complex relations from data, specially when the underlying relations are unknown. Video summarization is an instance of such cases. The fact that different users prefer different summaries is a strong evidence to complexity of the problem. To overcome the impediments, one solution is to learn how to make good summaries in a supervised manner. The degree of supervision, however, is different in the literature. In [24, 25, 26, 27], weakly supervised web image and video priors help define visual importance, captions associated with videos used by [28, 29] to infer semantic importance. Finally, many frameworks (e.g., [11, 8, 9, 10, 5]) learn a summarizer directly from user-annotated summaries.

Sequence-to-Sequence Learning.

Sequence-to-sequence (Seq2seq) modeling has been successfully employed in a vast set of applications, especially in Natural Language Processing (NLP). By the use of Recurrent Neural Networks (RNNs), impressive modeling capabilities and results are achieved in various fields such as machine translation 


and text generation applications (e.g., for image and video captioning 

[30, 31]).

Seq2seq models are conveniently trained as conditional language models, maximizing the probability of observing next ground truth word conditioned on the input and target words. This translates to using merely a word-level loss (usually a simple cross-entropy over the vocabulary).

While the training procedure described above has shown to be effective in various word-generation tasks, the learned models are not used as conditional models during inference at test time. Conventionally, a greedy approach is taken to generate the output sequence. Moreover, when evaluating, the complete output sequence is compared against the gold target sequence using a sequence-level evaluation metric such as ROUGE [32] and BLEU [33].

Determinantal point process (DPP).

A discrete DPP [13, 34] defines a distribution over all the subsets of a ground set measuring the negative correlation, or repulsion, of the elements in each subset. Given a ground set , one can define , a positive semi-definite kernel matrix that represents the per-element importance as well as the pairwise similarities between the elements. A distribution over a random subset is a DPP, if for every the following holds:


where is the squared sub-kernel of with rows and columns indexed by the elements in , and is the determinant function. is referred to as the marginal kernel since one can compute the probability of any subset being included in . It is the property of the determinant that promotes diversity: in order to have a high probability , the per-element importance terms and must be high and meanwhile the pairwise similarity terms must be low.

To directly specify the atomic probabilities for all the subsets of , Borodin and Rains derived another form of DPPs through a positive semi-definite matrix  [35], where

is an identity matrix. It samples a subset

with probability


where the denominator is a normalization constant.

Sequential DPP (seqDPP).

Gong et al. proposed SeqDPP [36] to preserve partial orders of the elements in the ground set. Given a long sequence of elements (e.g., video shots), we divide them into disjoint yet consecutive partitions . The elements within each partition are orderless to apply DPP and yet the orders among the partitions are observed in the following manner. At the -th time step, SeqDPP selects a diverse subset of elements by a variable from the corresponding partition and conditioned on the elements selected from the previous partition. In particular, the distribution of the subset selection variable is given by a conditional DPP,


where and are two L-ensemble DPPs with the ground sets and , respectively — namely, the conditional DPP itself is a valid DPP over the “shrinked” ground set. The relationship between the two L-ensemble kernels and is given by [35],


where is an identity matrix of the same size as except that the diagonal entries corresponding to are 0’s, is the squared submatrix of indexed by the elements in , and the number of rows/columns of the last identity matrix equals the size of the -th video segment .

3 A large-margin algorithm for learning SeqDPPs

We present the main large-margin learning algorithm in this section. We first review the mismatch between the training and inference of SeqDPPs [8] and then describe the large-margin algorithm in detail.

Training and inference of SeqDPP.

For the application of supervised video summarization, SeqDPP is trained by maximizing the likelihood (MLE) of user summaries. At the test time, however, an approximate online inference is employed:


We note that, in the inference phase, a possible error at one time step (e.g., ) propagates to the future but MLE always feeds the oracle summary to SeqDPP in the training stage (i.e., exposure bias [19]). Besides, the likelihood based objective function used in training does not necessarily correlate well with the evaluation metrics in the test stage (i.e., loss-evaluation mismatch [19]).

The issues above are common in seq2seq learning. It has been shown that improved results can be achieved if one tackles them explicitly [37, 38, 39, 19, 40]. Motivated by thثse findings, we propose a large-margin algorithm for SeqDPP to mitigate the exposure bias and loss-evaluation mismatch issues in existing SeqDPP works. Our algorithm mainly borrows some ideas from [20], which studies the large-margin principle in training recurrent neural networks. However, we are not constrained by the beam search, do not need to change the probabilistic SeqDPP model to any non-probabilistic version, and also fit a test-time evaluation metric into the large-margin formulation.

We now design a loss function


which includes two components: 1) a sequence-level cost which allows us to scale the loss function depending on how erroneous the test-time inference is compared to the oracle summary, and 2) a margin-sensitive loss term which penalizes the situation when the probability of an oracle sequence fails to exceed the probability of the model-inferred ones by a margin. Denote by and the subsets selected from the -th partition by SeqDPP and by an “oracle” user, respectively. Let represent the oracle summary until time step . The sequence-level cost can be any metric (e.g.,

) used to contrast a system-generated summary with a user summary.

Assuming SeqDPP is able to choose the right subset from partition , given the next partition , the margin-sensitive loss penalizes the situation that the model selects a different subset from the oracle ,


where . When we use this loss term in training SeqDPP, we always assume that the correct subset is chosen at the previous time step . In other words, we penalize the model step by step instead of checking the whole sequence of subsets predicted by the model. This allows more effective training by 1) forcing the model to choose correct subsets at every time step, and 2) enabling us to set the gradient weights according to how erroneous a mistake at this time step actually is in the eyes of evaluation metric.

Compared to MLE, it is especially appealing that the large-margin formulation flexibly takes the evaluation metric into account. As a result, it does not require SeqDPP to predict exactly the same summaries as the oracles. Instead, when the predicted and oracle summaries are equivalent (not necessarily identical) according to the evaluation metric, the model parameters are not updated.

4 Disentangling size and content in SeqDPP

In this section, we propose a sequential model of generalized DPPs (SeqGDPP) that accepts an arbitrary distribution over the sizes of the subsets whose content follow DPP distributions. It allows users to provide priors or constraints over the total items to be selected. We first present the generalized DPP and then describe how to use it to devise the sequential model, SeqGDPP.

4.1 Generalized DPPs (GDPPs)

Kulesza and Taskar have made an intriguing observation about the vanilla DPP: it conflates the size and content of the variable for selecting subsets from the ground set  [21]. To see this point more clearly, we can re-write a DPP as a mixture of elementary DPPs  [41, Lemma 2.6],


where the first summation is over all the possible sizes of the subsets and the second is about the particular items of each subset.

Eigen-decomposing the L-ensemble kernel to , the marginal kernel of the elementary DPP is — it is interesting to note that, due to this form of the marginal kernel, the elementary DPPs do not have their counterpart L-ensembles. The elementary DPP always chooses items from the ground set , namely, .

Eq. (10

) indicates that, to sample from the vanilla DPP, one may sample the size of a subset from a uniform distribution followed by drawing items/content for the subset. We propose to perturb this process and explicitly impose a distribution

over the sizes of the subsets,


As a result, the generalized DPP (GDPP) entails both DPP and -DPP [21] as special cases (when is uniform and when is a Dirac delta distribution, respectively), offering a larger expressive spectrum. Another interesting result is for a truncated uniform distribution over the sizes of the subsets. In this case, we arrive at a DPP which selects subsets with bounded cardinality, . Such constraint arises from real applications like document summarization, image display, and sensor placement.


The normalization constant for GDPPis . Details are included in the supplementary materials (Suppl.). The computation complexity of this normalization depends on the eigen-decomposition of

. With the eigenvalues

, we can compute the constant in polynomial time with some slight change to the recursive algorithm [41, Algorithm 7], which calculates all the elementary symmetric polynomials for in time. Therefore, the overall complexity of computing the normalization constant for GDPP is about the same as the complexity of normalizing an L-ensemble DPP (i.e., computing ).


With the normalization constant , we are ready to write out the probability of selecting a particular subset from the ground set by GDPP,


in which the concise form is due to the property of the elementary DPPs that when .

GDPP as a mixture of -DPPs.

The GDPP expressed above has a close connection to the -DPPs [21]. This is not surprising due to the definition of GDPP (cf. Eq. (11)). Indeed, GDPP can be exactly interpreted as a mixture of -DPPs ,

if all the -DPPs, i.e., the mixture components, share the same L-ensemble kernel as GDPP.

If we introduce a new notation for the mixture weights, , the GDPP can then be written as


Moreover, there is no necessity to adhere to the involved expression of . Under some scenarios, directly playing with may significantly ease the learning process. We will build a sequential model upon the GDPP of form (13) in the next section.

Exact sampling.

Following the interpretation of GDPP as a weighted combination of -DPPs, we have the following decomposition of the probability:

where, with a slight abuse of notation, we let denote the probability of sampling a -DPP from GDPP. Therefore, we can employ a two-phase sampling procedure from the GDPP,

  • Sample from the discrete distribution .

  • Sample from -DPP.

4.2 A sequential model of GDPPs (SeqGDpp)

In this section, we construct a sequential model of the generalized DPPs (SeqGDPP) such that not only it models the temporal and diverse properties as SeqDPP does, but also allows users to specify the prior or constraint over the length of the video summary.

We partition a long video sequence into disjoint yet consecutive short segments . The main idea of SeqGDPP is to adaptively distribute the expected length of the video summary to different video segments over each of which a GDPP is defined. In particular, we replace the conditional DPPs in SeqDPP (cf. eq. (4)) by GDPPs,


where the last equality follows Eq. (13), and recall that the L-ensemble kernel encodes the dependencies on the video frames/shots selected from the immediate past segment (cf. Section 2, Eq. (5)). The discrete distribution is over all the possible sizes of the subsets at time step .

We update adaptively according to


where the mean is our belief about how many items should be selected from the current video segment and the concentration factor tunes the confidence of the belief. When approaches infinity, the GDPP degenerates to -DPP and chooses exactly items into the video summary.

Our intuition for parameterizing the mean encompasses three pieces of information: the expected length over the overall video summary, number of items that have been selected into the summary up to the -th time step, and the variety of the visual content in the current video segment . Specifically,


where the first term is the average number of items to be selected from each of the remaining video segments to make up an overall summary of length , the second term moves around the average number depending on the current video segment , and

extracts a feature vector from the segment. We learn

from the training data — user annotated video summaries and their underlying videos. We expect that a visually homogeneous video segment gives rise to negative such that less than the average number of items will be selected from it, and vice versa.

4.3 Learning and inference

For the purpose of out-of-sample extension, we shall parameterize SeqGDPP in such a way that, at time step , it conditions on the corresponding video segment and the selected shots from the immediate previous time step. We use a simple convex combination [21] of base GDPPs whose kernels are predefined over the video for the parameterization. Concretely, at each time step ,


where the L-ensemble kernels of the base GDPPs are derived from the corresponding kernels of the conditional DPPs (eq. (5)). We compute different Gaussian RBF kernels for from the segment and previously selected subset by varying the bandwidths. The combination coefficients () are learned from the training videos and summaries.

Consider a single training video and its user summary for the convenience of presentation. We learn SeqGDPP by maximizing the log-likelihood,

5 Experimental Setup and Results

In this section, we provide details on compiling an egocentric video summarization dataset, annotation process, and the employed evaluation procedure.


While various video summarization datasets exist [42, 28, 43], we put consumer grade egocentric videos in our priority. Due to their lengthy nature, they carry a high level of redundancy, making summarization a vital and challenging problem. UT Egocentric  [22] dataset includes 4 videos each between 35 hours long, covering activities such as driving, shopping, studying, etc. in an uncontrolled environment. However, we find this dataset insufficient for supervised video summarization summarization, hence, we significantly extend it by adding another 8 egocentric videos to it (averaging over 6 hours each) from social interactions dataset [23]. These videos are recorded using head-mounted cameras worn by individuals during their visit to Disney parks. Our efforts results in a dataset consisting of 12 long videos with a total of over 60 hours of video content.

User Summary Collection.

Having compiled a set of 12 egocentric videos, we recruit three students to summarize the videos. The only instruction we give them is to operate on the 5-second video shot level. Namely, the full shot will be selected into the summary once any frame in the shot is chosen. Without any further constraints, the participants thus can use their own granularities and preferences to summarize the videos. Table(1) exhibits that user have their own distinct preferences about the summary lengths.

User 1 User 2 User 3 Oracle
Min 79 74 45 74
Max 174 222 352 200
Avg. 105.7527.21 133.3354.04 177.9290.96 135.9245.99
Table 1: Some statistics about the lengths of the summaries generated by three annotators.

Oracle Summaries.

Supervised video summarization approaches are conventionally trained on one target summary per video. Having obtained 3 user summaries per video, we aggregate them into one oracle summary using a greedy algorithm that has been used in several previous works [8, 9, 10], and train the model on them. We leave the details of the algorithm to the supplementary materials.


We follow Zhang et al. [11] in extracting the features using pre-trained GoogleNet [44], after the pool5 layer, which results in a 1024-d feature representation for each shot in the video.

(a) temporal filter
(b) Gaussian temporal filter
Figure 1: Comparison results for generic video summarization task. x axis represent the temporal filter parameter. In case of filter, it indicates how far a match can be temporally (in terms of seconds), whereas in the Gaussian filter, it is the kernel bandwidth.


There has been a plethora of different approaches for evaluating the quality of video summaries including user studies [45, 46], using low-level or pixel-level measurements to compare system summaries versus human summaries [8, 24, 25, 47, 4], and temporal-overlap-based metrics defined for two summaries [42, 5, 7, 11]. We share the same opinion as [48, 9, 10] in evaluating the summaries using high-level semantic information.

For measuring the quality of system summaries, Sharghi et al. [10] proposed to obtain dense shot-level concept annotation and convert them to semantic vectors where 1’s and 0’s indicate the presence or absence of a visual concepts such as Sky, Car, Tree, and etc. for that specific shot. It is straightforward to measure similarity between two shots using intersection-over-union (IoU) of their corresponding tags. For instance, if one shot is tagged by {Street,Tree,Sun} and the other by {Lady,Car,Street,Tree}, then the IoU is . Having defined the similarity measure between shots, one can conveniently perform maximum weight matching on the bipartite graph, where the user and system summaries are placed on opposing sides of the graph.

To collect shot-level concept annotations, we start with the dictionary of [10], and remove the concepts that do not appear often enough such as Boat and Ocean from it. Furthermore, we apply SentiBank detectors [49]

(with over 1400 pre-trained classifiers) on the frames of the videos to make a list of visual concepts appearing commonly throughout the dataset. Next, by watching the videos, we select from this list the top candidates and append them into the final dictionary that includes 54 concepts. These steps are mandatory as our dataset contains over 3 times the video content in 

[10]. Figure 2 illustrates the appearance count of visual concept throughout our dataset.

Having constructed a dictionary of concepts, we uniformly sample 5 frames from each shot and ask Amazon Mechanical Turk workers to tag them with the concepts. The instruction here is that a concept must be selected if it appears in any of the 5 frames. We hire 3 Turkers per shot and pool their annotations by taking the union. On average, each shot is tagged with 11 concepts. This is significantly larger than the average of 4 tags/shot in Sharghi et al. [10], resulting in more reliable assessment upon evaluation.

While the metric introduced in [10] compares summaries using high-level concept similarities, it allows a shot in system summary to be matched with any shot in the user summary without any temporal restrictions. This causes at least two problems. First, for an important shot in the gold summary, there is a chance we match it to a visually similar shot that may have happened long before or after. Second, since the shot similarities are necessarily positive, matching weakly similar shots that are temporally far, falsely increases the matching score. To fix these two issues, we modify this metric by applying a temporal filter on the measured similarities. We use two types of filters: 1) (a.k.a rectangular) function and 2) Gaussian function. filter sets the similarities outside of a time range to zero, hence forcing the metric to match a shot to its temporally close candidates. Gaussian filter on the other hand applies a decaying factor on farther matches.

To evaluate a summary, we compare it to all 3 user-annotated summaries and average the scores. We report the performance by varying corresponding filter’s parameters, the temporal window size and the bandwidth in and Gaussian filters respectively, illustrated in Figure(1). In addition, we compute the Area-Under-the-Curve (AUC) of average F1-scores in Table(2). It is worth mentioning that setting the parameters of the filters to infinity results in the same metric defined by Sharghi et al. [10]. Our metric is thus a generalization of the latter.

Data split.

In order to have a comprehensive assessment of the models, we employ leave-one-out strategy. Therefore, we run 12 set of experiments, each time leaving one video out for testing, two for validation (to tune hyper-parameters), and the remaining 9 for training the models. We report the average performance on all 12 videos later in this section.

Large-Margin Training/Inference.

Similar to practices in seq-2-seq learning [20, 19], we accelerate training by pre-training using standard sequential models, i.e. maximizing the likelihood of user summaries using SGD. This serves as a good network initialization, resulting in faster training process. At the test time, we follow Eq.(6) to generate the system summary.

SeqGDPP Details.

Given the features that are extracted using GoogleNet, we compute Gaussian RBF kernels over the video shots by varying the bandwidths , where is the median of all pairwise distances between the video shots. Note that the base kernels for GDPPs and then computed through eq. (5) such that they take account of the dependency between two adjacent time steps.

We also need to extract the feature vector to capture the information in each video segment . In eq. (17), we use such feature vector to help fine-tune the mean of the distribution

over the possible subset sizes. Intuitively, larger subsets should be selected from segments with more frequent visual appearance changes. As such, we compute the standard deviation per feature dimension within the segment

for .

There are three sets of parameters in SeqGDPP: and in the distribution over the subset size, and for the convex combination of some base GDPPs. We maximize the log-likelihood simply using gradient descent to solve for and , and cross-validating .

Query-Focused Video Summarization.

As defined by Sharghi et al. [9], due to the subjectivity of video summarization, it is appealing to personalize the summary based on user’s preferences. Hence, in query-focused summarization, deciding whether to include a video shot in the summary or not, depends jointly on shot’s relevance to a query term (that comes from the user) and its importance in the context of the video. In [10], they made available a collection of 184 {video,query} pair. To further assess our models, we compare them to the state-of-the-art query-focused video summarization frameworks in the supplementary material.

Uniform 12.33 12.36
SubMod [5] 11.20 11.12
SuperFrames [42] 11.46 11.28
LSTM-DPP [11] 7.38 7.36
SeqDPP [8] 9.71 9.56
LM-SeqDPP 15.05 14.69
SeqGDPP 15.29 14.86
LM-SeqGDPP 15.87 15.43
Table 2: Comparison results for supervised video summarization (%). The AUCs are computed by the F1-score curves drawn in Figure 1 until the 60 seconds mark. The blue and red colors group the base model and its large-margin version.

5.1 Quantitative Results and Analyses

In this section, we report quantitative results comparing our proposed models against various baselines:

Uniform. As the name suggests, we sample shots with fixed step size from the video such that the generated summary has equal length (same number of shots) as the oracle summary.

SubMod. Gygli et al. [5] learn a convex combination of interestingness, representativeness, and uniformity from user summaries in a supervised manner. At the test time, given the expected summary length, that is the length of the oracle summary, model generates the summary.

SuperFrames. In [42], Gygli et al. first segment the video into superframes and then measure their individual importance scores. Given the scores, the subsets that achieve the highest accumulative scores are considered the desired summary. Since a shot is 5-second long in our dataset, we skip the super-frame segmentation component. We train a neural network consisting of three fully-connected layers to measure each shot’s importance score, and then choose the subsets with the highest accumulated scores as the summary.

LSTM-DPP. In [11], Zhang et al. exploit LSTMs to model the temporal dependency between the shots of the video, and further use DPPs to enforce diversity in selecting important shots. Similar to previous baselines, this model has access to the expected summary length at the test time.

SeqDPP. This is the original framework of Gong et al. [8]. Unlike other baselines, this model determines the summary length automatically.

Various interesting and promising observations can be made from Table(2) and Figure(1):

1) Comparing SeqDPP and large-margin SeqDPP (regarded as LM-SeqDPP), we observe a significant performance boost. As illustrated in Figure(1), the performance gap is consistently large throughout different filter parameters. Although both SeqDPP and LM-SeqDPP determine the summary length automatically, our speculations show that the latter makes summaries that resemble the oracle summaries in terms of both length and conveyed semantic information.

Figure 2: Count of concept appearances in the collected annotations accumulated over all 12 videos.

2) Comparing SeqGDPP to SeqDPP, for which users cannot tune the expected length of the summary, we can see that SeqGDPP significantly outperforms SeqDPP. This is not surprising since SeqDPP does not have a mechanism to take the user supplied summary length into account. As a result, the number of selected shots by SeqDPP is sometimes much less or more than the length of the user summary.

3) Large-margin SeqGDPP (LM-SeqGDPP) performs slightly better than SeqGDPP, and it outperforms all models. As both models generate system summaries of equal length to the oracle, large-margin formulation helps making better summaries by optimizing for the evaluation metric.

4) As described earlier, our refined evaluation scheme is a generalization of the BM; by setting the filter parameters to infinity (hence no temporal restriction enforced by the filters), we can obtain the performance under the BM metric, represented by the last points of the curves in Figure(1). While performance under our refined metric is significantly difference from model to model, under the BM metric, models perform almost the same. This is due to the problems we mentioned earlier in Section 5, where we discussed the evaluation metric.

6 Conclusion

In this work, we made twofold contribution towards improving sequential determinantal point process-based models for supervised video summarization. We proposed a large-margin training scheme that facilitates learning models more effectively by addressing common problems in most seq2seq frameworks – exposure bias and loss-evaluation mismatch. In modeling terms, we introduce a new probabilistic block GDPP that when integrated into SeqDPP, the resulting model can accept priors about expected summary length. Furthermore, we compiled a large video summarization dataset consisting of 12 egocentric videos totalling over 60 hours content. Additionally, we collected 3 user-annotated summaries per video as well as dense concept annotations required for evaluation. Finally, we conduct experiments on the dataset to verify the effectiveness of the proposed models.


  • [1] OBILE, W.: Ericsson mobility report (2016)
  • [2] Hirsch, R.: Seizing the Light: A Social & Aesthetic History of Photography. Taylor & Francis (2017)
  • [3] Wolf, W.: Key frame selection by motion analysis. In: Acoustics, Speech, and Signal Processing, 1996. ICASSP-96. Conference Proceedings., 1996 IEEE International Conference on. Volume 2., IEEE (1996) 1228–1231
  • [4] Zhao, B., Xing, E.P.: Quasi real-time summarization for consumer videos.

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 2513–2520

  • [5] Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3090–3098
  • [6] Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: European conference on computer vision, Springer (2014) 787–802
  • [7] Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: European conference on computer vision, Springer (2014) 540–555
  • [8] Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems. (2014) 2069–2077
  • [9] Sharghi, A., Gong, B., Shah, M.: Query-focused extractive video summarization. In: European Conference on Computer Vision, Springer (2016) 3–19
  • [10] Sharghi, A., Laurel, J.S., Gong, B.: Query-focused video summarization: Dataset, evaluation, and a memory network based approach. arXiv preprint arXiv:1707.04960 (2017)
  • [11] Zhang, K., Chao, W.L., Sha, F., Grauman, K.:

    Video summarization with long short-term memory.

    In: European Conference on Computer Vision, Springer (2016) 766–782
  • [12] Sadeghian, A., Sundaram, L., Wang, D.Z., Hamilton, W.F., Branting, K., Pfeifer, C.: Automatic semantic edge labeling over legal citation graphs. Artificial Intelligence and Law 26(2) (2018) 127–144
  • [13] Kulesza, A., Taskar, B., et al.:

    Determinantal point processes for machine learning.

    Foundations and Trends® in Machine Learning 5(2–3) (2012) 123–286
  • [14] Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). (2011) 1017–1024
  • [15] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. (2014) 3104–3112
  • [16] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
  • [17] Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Advances in Neural Information Processing Systems. (2015) 2773–2781
  • [18] Serban, I.V., Sordoni, A., Bengio, Y., Courville, A.C., Pineau, J.: Building end-to-end dialogue systems using generative hierarchical neural network models. In: AAAI. (2016) 3776–3784
  • [19] Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
  • [20] Wiseman, S., Rush, A.M.: Sequence-to-sequence learning as beam-search optimization. arXiv preprint arXiv:1606.02960 (2016)
  • [21] Kulesza, A., Taskar, B.: k-dpps: Fixed-size determinantal point processes. In: Proceedings of the 28th International Conference on Machine Learning (ICML). (2011) 1193–1200
  • [22] Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 1346–1353
  • [23] Fathi, A., Hodgins, J.K., Rehg, J.M.: Social interactions: A first-person perspective. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 1226–1233
  • [24] Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013) 2698–2705
  • [25] Kim, G., Sigal, L., Xing, E.P.: Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 4225–4232
  • [26] Xiong, B., Grauman, K.: Detecting snap points in egocentric video with a web photo prior. In: European Conference on Computer Vision, Springer (2014) 282–298
  • [27] Chu, W.S., Song, Y., Jaimes, A.: Video co-summarization: Video summarization by visual co-occurrence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3584–3592
  • [28] Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: Summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 5179–5187
  • [29] Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multi-task deep visual-semantic embedding for video thumbnail selection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3707–3715
  • [30] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: Proceedings of the IEEE international conference on computer vision. (2015) 4534–4542
  • [31] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. (2015) 2048–2057
  • [32] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out: Proceedings of the ACL-04 workshop. Volume 8., Barcelona, Spain (2004)
  • [33] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics (2002) 311–318
  • [34] Hough, J.B., Krishnapur, M., Peres, Y., Virág, B., et al.: Determinantal processes and independence. Probability surveys 3 (2006) 206–229
  • [35] Borodin, A., Rains, E.M.: Eynard–mehta theorem, schur process, and their pfaffian analogs. Journal of statistical physics 121(3) (2005) 291–317
  • [36] Gong, B., Chao, W., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems (NIPS). (2014) 2069–2077
  • [37] Daumé, H., Langford, J., Marcu, D.: Search-based structured prediction. Machine learning 75(3) (2009) 297–325
  • [38] Ross, S., Gordon, G.J., Bagnell, D.:

    A reduction of imitation learning and structured prediction to no-regret online learning.

    In: International Conference on Artificial Intelligence and Statistics. (2011) 627–635
  • [39] Collins, M., Roark, B.:

    Incremental parsing with the perceptron algorithm.

    In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics (2004) 111
  • [40] Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., Liu, Y.: Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433 (2015)
  • [41] Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5(2–3) (2012) 123–286
  • [42] Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: European conference on computer vision, Springer (2014) 505–520
  • [43] De Avila, S.E.F., Lopes, A.P.B., da Luz Jr, A., de Albuquerque Araújo, A.: Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32(1) (2011) 56–68
  • [44] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 1–9
  • [45] Lee, Y.J., Grauman, K.: Predicting important objects for egocentric video summarization. International Journal of Computer Vision 114(1) (2015) 38–55
  • [46] Lu, Z., Grauman, K.: Story-driven summarization for egocentric video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013) 2714–2721
  • [47] Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Summary transfer: Exemplar-based subset selection for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1059–1067
  • [48] Yeung, S., Fathi, A., Fei-Fei, L.: Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824 (2014)
  • [49] Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.F.: Large-scale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on Multimedia, ACM (2013) 223–232