1 Introduction
It is now much easier than ever before to produce videos due to ubiquitous acquisition capabilities. The videos captured by UAVs and drones, from ground surveillance, and by bodyworn cameras are easily reaching the scale of gigabytes per day. In 2017, it was estimated that there were at least 2.32 billion active camera phones in the world
[1]. In 2015, 2.4 million GoPro body cameras were sold worldwide [2]. While the big video data is a great source for information discovery and extraction, the computational challenges are unparalleled. Automatically summarizing the videos has become a substantial need for browsing, searching, and indexing visual content.Under the extractive video summarization framework, a summary is composed of important shots of the underlying video. This notion of importance, however, varies drastically from work to work in the literature. Wolf defines the importance as a function of motion cues [3]. Zhao and Xing formulate it by reconstruction errors [4]. Gygli et al. learn a mixture of interestingness, representativeness, and uniformity measures to find what is important [5]. These differences highlight the complexity of video summarization. The criteria for summarizing vastly depend on the content, styles, lengths, etc. of the video and, perhaps more importantly, users’ preferences. For instance, to summarize a surveillance video, a running action might flag an important event whereas in a football match it can be a normal action observed throughout the video.
To overcome those challenges, there are two broad categories of approaches in general. One is to reduce the problem domain to a homogeneous set of videos which share about the same characteristics (e.g., length and style) so that experts can engineer some domainspecific criteria of good summaries [6, 7]. The other is to design models that can learn the criteria automatically, often from humanannotated summaries in a supervised manner [8, 9, 10, 11, 12]. The latter is more appealing because a model can be trained for different settings of choice, while the former is not as scalable.
This paper is also in the vein of supervised video summarization based on determinantal point process (DPP) [13]
. Arising from quantum physics and random matrix theories, DPP is a powerful tool to balance importance and diversity, two axiomatic properties in extractive video summarization. Indeed, a good summary must be collectively diverse in the sense that it should not have redundancy of information. Moreover, a shot selected into the summary must add value to the quality of the summary; otherwise, it is not important in the context of the summary. Thanks to the versatility of DPP and one of its extension called SeqDPP
[8] for handling sequences, they have been employed in a rich line of recent works on video summarization [9, 10].This paper makes twopronged contribution towards improving these models to more effectively learn better video summarizers. In terms of learning, we propose a largemargin algorithm to address the SeqDPP’s exposure bias problem explained below. In terms of modeling, we design a new probabilistic block such that, when it is integrated into SeqDPP, the resulting model accepts user input about the expected length of the summary.
We first explain the exposure bias problem with the existing SeqDPP works — it is actually a mismatch issue in many sequence to sequence (seq2seq) learning methods [14, 15, 16, 17, 18]. When the model is trained by maximizing the likelihood of user annotations, the model takes as input user annotated “oracle” summaries. At the test time, however, the model generates output by searching over the output space in a greedy fashion and its intermediate conditional distributions may receive input from the previous time step that deviates from the oracle. In other words, the model is exposed to different environments in the training and testing stages, respectively. This exposure bias also results in the lossevaluation mismatch [19] between the training phase and the inference. To tackle these issues, we adapt the LargeMargin algorithm originally derived for training LSTMs [20]
to the SeqDPPs. The main idea is to alleviate the exposure bias by incorporating inference techniques of the test time into the objective function used for training. Meanwhile, we add to the largemargin formulation a multiplicative reward term that is directly related to the evaluation metrics to mitigate the lossevaluation mismatch.
In addition to the new largemargin learning algorithm, we also improve the SeqDPP model by a novel probabilistic distribution in order to allow users to control the lengths of systemgenerated video summaries. To this end, we propose a generalized DPP (GDPP) in which an arbitrary prior distribution can be imposed over the sizes of subsets of video shots. As a result, both vanilla DPP and DPP [21] can be considered as special instances of GDPP. Moreover, we can conveniently substitute the (conditional) DPPs in SeqDPP by GDPP. When a user gives an expected length of the summary, we dynamically allocate it to different segments of the video and then choose the right numbers of video shots from corresponding segments.
We conduct extensive experiments to verify the improved techniques for supervised video summarization. First of all, we significantly extend the UTE dataset [22] and its annotations of video summaries and pershot concepts [10] by another eight egocentric videos [23]. Following the protocol described in [10], we collect three user summaries for each of the hourslong videos as well as concept annotations for each video shot. We evaluate the largemargin learning algorithm on not only the proposed sequential GDPP but also the existing SeqDPP models.
2 Related work and background
We briefly review the related work in this section. Besides, we also describe the major body of DPPs and SeqDPPs. Readers are referred to [13] and [8]
for more details and properties of the two versatile probability models.
Supervised video summarization.
In recent years, datadriven learning to tackle research problems has attracted plenty of attention. This is mainly because they can learn complex relations from data, specially when the underlying relations are unknown. Video summarization is an instance of such cases. The fact that different users prefer different summaries is a strong evidence to complexity of the problem. To overcome the impediments, one solution is to learn how to make good summaries in a supervised manner. The degree of supervision, however, is different in the literature. In [24, 25, 26, 27], weakly supervised web image and video priors help define visual importance, captions associated with videos used by [28, 29] to infer semantic importance. Finally, many frameworks (e.g., [11, 8, 9, 10, 5]) learn a summarizer directly from userannotated summaries.
SequencetoSequence Learning.
Sequencetosequence (Seq2seq) modeling has been successfully employed in a vast set of applications, especially in Natural Language Processing (NLP). By the use of Recurrent Neural Networks (RNNs), impressive modeling capabilities and results are achieved in various fields such as machine translation
[16]and text generation applications (e.g., for image and video captioning
[30, 31]).Seq2seq models are conveniently trained as conditional language models, maximizing the probability of observing next ground truth word conditioned on the input and target words. This translates to using merely a wordlevel loss (usually a simple crossentropy over the vocabulary).
While the training procedure described above has shown to be effective in various wordgeneration tasks, the learned models are not used as conditional models during inference at test time. Conventionally, a greedy approach is taken to generate the output sequence. Moreover, when evaluating, the complete output sequence is compared against the gold target sequence using a sequencelevel evaluation metric such as ROUGE [32] and BLEU [33].
Determinantal point process (DPP).
A discrete DPP [13, 34] defines a distribution over all the subsets of a ground set measuring the negative correlation, or repulsion, of the elements in each subset. Given a ground set , one can define , a positive semidefinite kernel matrix that represents the perelement importance as well as the pairwise similarities between the elements. A distribution over a random subset is a DPP, if for every the following holds:
(1) 
where is the squared subkernel of with rows and columns indexed by the elements in , and is the determinant function. is referred to as the marginal kernel since one can compute the probability of any subset being included in . It is the property of the determinant that promotes diversity: in order to have a high probability , the perelement importance terms and must be high and meanwhile the pairwise similarity terms must be low.
To directly specify the atomic probabilities for all the subsets of , Borodin and Rains derived another form of DPPs through a positive semidefinite matrix [35], where
is an identity matrix. It samples a subset
with probability(2) 
where the denominator is a normalization constant.
Sequential DPP (seqDPP).
Gong et al. proposed SeqDPP [36] to preserve partial orders of the elements in the ground set. Given a long sequence of elements (e.g., video shots), we divide them into disjoint yet consecutive partitions . The elements within each partition are orderless to apply DPP and yet the orders among the partitions are observed in the following manner. At the th time step, SeqDPP selects a diverse subset of elements by a variable from the corresponding partition and conditioned on the elements selected from the previous partition. In particular, the distribution of the subset selection variable is given by a conditional DPP,
(3)  
(4) 
where and are two Lensemble DPPs with the ground sets and , respectively — namely, the conditional DPP itself is a valid DPP over the “shrinked” ground set. The relationship between the two Lensemble kernels and is given by [35],
(5) 
where is an identity matrix of the same size as except that the diagonal entries corresponding to are 0’s, is the squared submatrix of indexed by the elements in , and the number of rows/columns of the last identity matrix equals the size of the th video segment .
3 A largemargin algorithm for learning SeqDPPs
We present the main largemargin learning algorithm in this section. We first review the mismatch between the training and inference of SeqDPPs [8] and then describe the largemargin algorithm in detail.
Training and inference of SeqDPP.
For the application of supervised video summarization, SeqDPP is trained by maximizing the likelihood (MLE) of user summaries. At the test time, however, an approximate online inference is employed:
(6) 
We note that, in the inference phase, a possible error at one time step (e.g., ) propagates to the future but MLE always feeds the oracle summary to SeqDPP in the training stage (i.e., exposure bias [19]). Besides, the likelihood based objective function used in training does not necessarily correlate well with the evaluation metrics in the test stage (i.e., lossevaluation mismatch [19]).
The issues above are common in seq2seq learning. It has been shown that improved results can be achieved if one tackles them explicitly [37, 38, 39, 19, 40]. Motivated by thثse findings, we propose a largemargin algorithm for SeqDPP to mitigate the exposure bias and lossevaluation mismatch issues in existing SeqDPP works. Our algorithm mainly borrows some ideas from [20], which studies the largemargin principle in training recurrent neural networks. However, we are not constrained by the beam search, do not need to change the probabilistic SeqDPP model to any nonprobabilistic version, and also fit a testtime evaluation metric into the largemargin formulation.
We now design a loss function
(7) 
which includes two components: 1) a sequencelevel cost which allows us to scale the loss function depending on how erroneous the testtime inference is compared to the oracle summary, and 2) a marginsensitive loss term which penalizes the situation when the probability of an oracle sequence fails to exceed the probability of the modelinferred ones by a margin. Denote by and the subsets selected from the th partition by SeqDPP and by an “oracle” user, respectively. Let represent the oracle summary until time step . The sequencelevel cost can be any metric (e.g.,
) used to contrast a systemgenerated summary with a user summary.
Assuming SeqDPP is able to choose the right subset from partition , given the next partition , the marginsensitive loss penalizes the situation that the model selects a different subset from the oracle ,
(8) 
where . When we use this loss term in training SeqDPP, we always assume that the correct subset is chosen at the previous time step . In other words, we penalize the model step by step instead of checking the whole sequence of subsets predicted by the model. This allows more effective training by 1) forcing the model to choose correct subsets at every time step, and 2) enabling us to set the gradient weights according to how erroneous a mistake at this time step actually is in the eyes of evaluation metric.
Compared to MLE, it is especially appealing that the largemargin formulation flexibly takes the evaluation metric into account. As a result, it does not require SeqDPP to predict exactly the same summaries as the oracles. Instead, when the predicted and oracle summaries are equivalent (not necessarily identical) according to the evaluation metric, the model parameters are not updated.
4 Disentangling size and content in SeqDPP
In this section, we propose a sequential model of generalized DPPs (SeqGDPP) that accepts an arbitrary distribution over the sizes of the subsets whose content follow DPP distributions. It allows users to provide priors or constraints over the total items to be selected. We first present the generalized DPP and then describe how to use it to devise the sequential model, SeqGDPP.
4.1 Generalized DPPs (GDPPs)
Kulesza and Taskar have made an intriguing observation about the vanilla DPP: it conflates the size and content of the variable for selecting subsets from the ground set [21]. To see this point more clearly, we can rewrite a DPP as a mixture of elementary DPPs [41, Lemma 2.6],
(9)  
(10) 
where the first summation is over all the possible sizes of the subsets and the second is about the particular items of each subset.
Eigendecomposing the Lensemble kernel to , the marginal kernel of the elementary DPP is — it is interesting to note that, due to this form of the marginal kernel, the elementary DPPs do not have their counterpart Lensembles. The elementary DPP always chooses items from the ground set , namely, .
Eq. (10
) indicates that, to sample from the vanilla DPP, one may sample the size of a subset from a uniform distribution followed by drawing items/content for the subset. We propose to perturb this process and explicitly impose a distribution
over the sizes of the subsets,(11) 
As a result, the generalized DPP (GDPP) entails both DPP and DPP [21] as special cases (when is uniform and when is a Dirac delta distribution, respectively), offering a larger expressive spectrum. Another interesting result is for a truncated uniform distribution over the sizes of the subsets. In this case, we arrive at a DPP which selects subsets with bounded cardinality, . Such constraint arises from real applications like document summarization, image display, and sensor placement.
Normalization.
The normalization constant for GDPPis . Details are included in the supplementary materials (Suppl.). The computation complexity of this normalization depends on the eigendecomposition of
. With the eigenvalues
, we can compute the constant in polynomial time with some slight change to the recursive algorithm [41, Algorithm 7], which calculates all the elementary symmetric polynomials for in time. Therefore, the overall complexity of computing the normalization constant for GDPP is about the same as the complexity of normalizing an Lensemble DPP (i.e., computing ).Evaluation.
With the normalization constant , we are ready to write out the probability of selecting a particular subset from the ground set by GDPP,
(12) 
in which the concise form is due to the property of the elementary DPPs that when .
GDPP as a mixture of DPPs.
The GDPP expressed above has a close connection to the DPPs [21]. This is not surprising due to the definition of GDPP (cf. Eq. (11)). Indeed, GDPP can be exactly interpreted as a mixture of DPPs ,
if all the DPPs, i.e., the mixture components, share the same Lensemble kernel as GDPP.
If we introduce a new notation for the mixture weights, , the GDPP can then be written as
(13) 
Moreover, there is no necessity to adhere to the involved expression of . Under some scenarios, directly playing with may significantly ease the learning process. We will build a sequential model upon the GDPP of form (13) in the next section.
Exact sampling.
Following the interpretation of GDPP as a weighted combination of DPPs, we have the following decomposition of the probability:
where, with a slight abuse of notation, we let denote the probability of sampling a DPP from GDPP. Therefore, we can employ a twophase sampling procedure from the GDPP,

Sample from the discrete distribution .

Sample from DPP.
4.2 A sequential model of GDPPs (SeqGDpp)
In this section, we construct a sequential model of the generalized DPPs (SeqGDPP) such that not only it models the temporal and diverse properties as SeqDPP does, but also allows users to specify the prior or constraint over the length of the video summary.
We partition a long video sequence into disjoint yet consecutive short segments . The main idea of SeqGDPP is to adaptively distribute the expected length of the video summary to different video segments over each of which a GDPP is defined. In particular, we replace the conditional DPPs in SeqDPP (cf. eq. (4)) by GDPPs,
(14)  
(15) 
where the last equality follows Eq. (13), and recall that the Lensemble kernel encodes the dependencies on the video frames/shots selected from the immediate past segment (cf. Section 2, Eq. (5)). The discrete distribution is over all the possible sizes of the subsets at time step .
We update adaptively according to
(16) 
where the mean is our belief about how many items should be selected from the current video segment and the concentration factor tunes the confidence of the belief. When approaches infinity, the GDPP degenerates to DPP and chooses exactly items into the video summary.
Our intuition for parameterizing the mean encompasses three pieces of information: the expected length over the overall video summary, number of items that have been selected into the summary up to the th time step, and the variety of the visual content in the current video segment . Specifically,
(17) 
where the first term is the average number of items to be selected from each of the remaining video segments to make up an overall summary of length , the second term moves around the average number depending on the current video segment , and
extracts a feature vector from the segment. We learn
from the training data — user annotated video summaries and their underlying videos. We expect that a visually homogeneous video segment gives rise to negative such that less than the average number of items will be selected from it, and vice versa.4.3 Learning and inference
For the purpose of outofsample extension, we shall parameterize SeqGDPP in such a way that, at time step , it conditions on the corresponding video segment and the selected shots from the immediate previous time step. We use a simple convex combination [21] of base GDPPs whose kernels are predefined over the video for the parameterization. Concretely, at each time step ,
(18) 
where the Lensemble kernels of the base GDPPs are derived from the corresponding kernels of the conditional DPPs (eq. (5)). We compute different Gaussian RBF kernels for from the segment and previously selected subset by varying the bandwidths. The combination coefficients () are learned from the training videos and summaries.
Consider a single training video and its user summary for the convenience of presentation. We learn SeqGDPP by maximizing the loglikelihood,
5 Experimental Setup and Results
In this section, we provide details on compiling an egocentric video summarization dataset, annotation process, and the employed evaluation procedure.
Dataset.
While various video summarization datasets exist [42, 28, 43], we put consumer grade egocentric videos in our priority. Due to their lengthy nature, they carry a high level of redundancy, making summarization a vital and challenging problem. UT Egocentric [22] dataset includes 4 videos each between 35 hours long, covering activities such as driving, shopping, studying, etc. in an uncontrolled environment. However, we find this dataset insufficient for supervised video summarization summarization, hence, we significantly extend it by adding another 8 egocentric videos to it (averaging over 6 hours each) from social interactions dataset [23]. These videos are recorded using headmounted cameras worn by individuals during their visit to Disney parks. Our efforts results in a dataset consisting of 12 long videos with a total of over 60 hours of video content.
User Summary Collection.
Having compiled a set of 12 egocentric videos, we recruit three students to summarize the videos. The only instruction we give them is to operate on the 5second video shot level. Namely, the full shot will be selected into the summary once any frame in the shot is chosen. Without any further constraints, the participants thus can use their own granularities and preferences to summarize the videos. Table(1) exhibits that user have their own distinct preferences about the summary lengths.
User 1  User 2  User 3  Oracle  

Min  79  74  45  74 
Max  174  222  352  200 
Avg.  105.7527.21  133.3354.04  177.9290.96  135.9245.99 
Oracle Summaries.
Supervised video summarization approaches are conventionally trained on one target summary per video. Having obtained 3 user summaries per video, we aggregate them into one oracle summary using a greedy algorithm that has been used in several previous works [8, 9, 10], and train the model on them. We leave the details of the algorithm to the supplementary materials.
Features.
We follow Zhang et al. [11] in extracting the features using pretrained GoogleNet [44], after the pool5 layer, which results in a 1024d feature representation for each shot in the video.
Evaluation.
There has been a plethora of different approaches for evaluating the quality of video summaries including user studies [45, 46], using lowlevel or pixellevel measurements to compare system summaries versus human summaries [8, 24, 25, 47, 4], and temporaloverlapbased metrics defined for two summaries [42, 5, 7, 11]. We share the same opinion as [48, 9, 10] in evaluating the summaries using highlevel semantic information.
For measuring the quality of system summaries, Sharghi et al. [10] proposed to obtain dense shotlevel concept annotation and convert them to semantic vectors where 1’s and 0’s indicate the presence or absence of a visual concepts such as Sky, Car, Tree, and etc. for that specific shot. It is straightforward to measure similarity between two shots using intersectionoverunion (IoU) of their corresponding tags. For instance, if one shot is tagged by {Street,Tree,Sun} and the other by {Lady,Car,Street,Tree}, then the IoU is . Having defined the similarity measure between shots, one can conveniently perform maximum weight matching on the bipartite graph, where the user and system summaries are placed on opposing sides of the graph.
To collect shotlevel concept annotations, we start with the dictionary of [10], and remove the concepts that do not appear often enough such as Boat and Ocean from it. Furthermore, we apply SentiBank detectors [49]
(with over 1400 pretrained classifiers) on the frames of the videos to make a list of visual concepts appearing commonly throughout the dataset. Next, by watching the videos, we select from this list the top candidates and append them into the final dictionary that includes 54 concepts. These steps are mandatory as our dataset contains over 3 times the video content in
[10]. Figure 2 illustrates the appearance count of visual concept throughout our dataset.Having constructed a dictionary of concepts, we uniformly sample 5 frames from each shot and ask Amazon Mechanical Turk workers to tag them with the concepts. The instruction here is that a concept must be selected if it appears in any of the 5 frames. We hire 3 Turkers per shot and pool their annotations by taking the union. On average, each shot is tagged with 11 concepts. This is significantly larger than the average of 4 tags/shot in Sharghi et al. [10], resulting in more reliable assessment upon evaluation.
While the metric introduced in [10] compares summaries using highlevel concept similarities, it allows a shot in system summary to be matched with any shot in the user summary without any temporal restrictions. This causes at least two problems. First, for an important shot in the gold summary, there is a chance we match it to a visually similar shot that may have happened long before or after. Second, since the shot similarities are necessarily positive, matching weakly similar shots that are temporally far, falsely increases the matching score. To fix these two issues, we modify this metric by applying a temporal filter on the measured similarities. We use two types of filters: 1) (a.k.a rectangular) function and 2) Gaussian function. filter sets the similarities outside of a time range to zero, hence forcing the metric to match a shot to its temporally close candidates. Gaussian filter on the other hand applies a decaying factor on farther matches.
To evaluate a summary, we compare it to all 3 userannotated summaries and average the scores. We report the performance by varying corresponding filter’s parameters, the temporal window size and the bandwidth in and Gaussian filters respectively, illustrated in Figure(1). In addition, we compute the AreaUndertheCurve (AUC) of average F1scores in Table(2). It is worth mentioning that setting the parameters of the filters to infinity results in the same metric defined by Sharghi et al. [10]. Our metric is thus a generalization of the latter.
Data split.
In order to have a comprehensive assessment of the models, we employ leaveoneout strategy. Therefore, we run 12 set of experiments, each time leaving one video out for testing, two for validation (to tune hyperparameters), and the remaining 9 for training the models. We report the average performance on all 12 videos later in this section.
LargeMargin Training/Inference.
Similar to practices in seq2seq learning [20, 19], we accelerate training by pretraining using standard sequential models, i.e. maximizing the likelihood of user summaries using SGD. This serves as a good network initialization, resulting in faster training process. At the test time, we follow Eq.(6) to generate the system summary.
SeqGDPP Details.
Given the features that are extracted using GoogleNet, we compute Gaussian RBF kernels over the video shots by varying the bandwidths , where is the median of all pairwise distances between the video shots. Note that the base kernels for GDPPs and then computed through eq. (5) such that they take account of the dependency between two adjacent time steps.
We also need to extract the feature vector to capture the information in each video segment . In eq. (17), we use such feature vector to help finetune the mean of the distribution
over the possible subset sizes. Intuitively, larger subsets should be selected from segments with more frequent visual appearance changes. As such, we compute the standard deviation per feature dimension within the segment
for .There are three sets of parameters in SeqGDPP: and in the distribution over the subset size, and for the convex combination of some base GDPPs. We maximize the loglikelihood simply using gradient descent to solve for and , and crossvalidating .
QueryFocused Video Summarization.
As defined by Sharghi et al. [9], due to the subjectivity of video summarization, it is appealing to personalize the summary based on user’s preferences. Hence, in queryfocused summarization, deciding whether to include a video shot in the summary or not, depends jointly on shot’s relevance to a query term (that comes from the user) and its importance in the context of the video. In [10], they made available a collection of 184 {video,query} pair. To further assess our models, we compare them to the stateoftheart queryfocused video summarization frameworks in the supplementary material.
Uniform  12.33  12.36 

SubMod [5]  11.20  11.12 
SuperFrames [42]  11.46  11.28 
LSTMDPP [11]  7.38  7.36 
SeqDPP [8]  9.71  9.56 
LMSeqDPP  15.05  14.69 
SeqGDPP  15.29  14.86 
LMSeqGDPP  15.87  15.43 
5.1 Quantitative Results and Analyses
In this section, we report quantitative results comparing our proposed models against various baselines:
– Uniform. As the name suggests, we sample shots with fixed step size from the video such that the generated summary has equal length (same number of shots) as the oracle summary.
– SubMod. Gygli et al. [5] learn a convex combination of interestingness, representativeness, and uniformity from user summaries in a supervised manner. At the test time, given the expected summary length, that is the length of the oracle summary, model generates the summary.
– SuperFrames. In [42], Gygli et al. first segment the video into superframes and then measure their individual importance scores. Given the scores, the subsets that achieve the highest accumulative scores are considered the desired summary. Since a shot is 5second long in our dataset, we skip the superframe segmentation component. We train a neural network consisting of three fullyconnected layers to measure each shot’s importance score, and then choose the subsets with the highest accumulated scores as the summary.
– LSTMDPP. In [11], Zhang et al. exploit LSTMs to model the temporal dependency between the shots of the video, and further use DPPs to enforce diversity in selecting important shots. Similar to previous baselines, this model has access to the expected summary length at the test time.
– SeqDPP. This is the original framework of Gong et al. [8]. Unlike other baselines, this model determines the summary length automatically.
1) Comparing SeqDPP and largemargin SeqDPP (regarded as LMSeqDPP), we observe a significant performance boost. As illustrated in Figure(1), the performance gap is consistently large throughout different filter parameters. Although both SeqDPP and LMSeqDPP determine the summary length automatically, our speculations show that the latter makes summaries that resemble the oracle summaries in terms of both length and conveyed semantic information.
2) Comparing SeqGDPP to SeqDPP, for which users cannot tune the expected length of the summary, we can see that SeqGDPP significantly outperforms SeqDPP. This is not surprising since SeqDPP does not have a mechanism to take the user supplied summary length into account. As a result, the number of selected shots by SeqDPP is sometimes much less or more than the length of the user summary.
3) Largemargin SeqGDPP (LMSeqGDPP) performs slightly better than SeqGDPP, and it outperforms all models. As both models generate system summaries of equal length to the oracle, largemargin formulation helps making better summaries by optimizing for the evaluation metric.
4) As described earlier, our refined evaluation scheme is a generalization of the BM; by setting the filter parameters to infinity (hence no temporal restriction enforced by the filters), we can obtain the performance under the BM metric, represented by the last points of the curves in Figure(1). While performance under our refined metric is significantly difference from model to model, under the BM metric, models perform almost the same. This is due to the problems we mentioned earlier in Section 5, where we discussed the evaluation metric.
6 Conclusion
In this work, we made twofold contribution towards improving sequential determinantal point processbased models for supervised video summarization. We proposed a largemargin training scheme that facilitates learning models more effectively by addressing common problems in most seq2seq frameworks – exposure bias and lossevaluation mismatch. In modeling terms, we introduce a new probabilistic block GDPP that when integrated into SeqDPP, the resulting model can accept priors about expected summary length. Furthermore, we compiled a large video summarization dataset consisting of 12 egocentric videos totalling over 60 hours content. Additionally, we collected 3 userannotated summaries per video as well as dense concept annotations required for evaluation. Finally, we conduct experiments on the dataset to verify the effectiveness of the proposed models.
References
 [1] OBILE, W.: Ericsson mobility report (2016)
 [2] Hirsch, R.: Seizing the Light: A Social & Aesthetic History of Photography. Taylor & Francis (2017)
 [3] Wolf, W.: Key frame selection by motion analysis. In: Acoustics, Speech, and Signal Processing, 1996. ICASSP96. Conference Proceedings., 1996 IEEE International Conference on. Volume 2., IEEE (1996) 1228–1231

[4]
Zhao, B., Xing, E.P.:
Quasi realtime summarization for consumer videos.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 2513–2520
 [5] Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3090–3098
 [6] Sun, M., Farhadi, A., Seitz, S.: Ranking domainspecific highlights by analyzing edited videos. In: European conference on computer vision, Springer (2014) 787–802
 [7] Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Categoryspecific video summarization. In: European conference on computer vision, Springer (2014) 540–555
 [8] Gong, B., Chao, W.L., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems. (2014) 2069–2077
 [9] Sharghi, A., Gong, B., Shah, M.: Queryfocused extractive video summarization. In: European Conference on Computer Vision, Springer (2016) 3–19
 [10] Sharghi, A., Laurel, J.S., Gong, B.: Queryfocused video summarization: Dataset, evaluation, and a memory network based approach. arXiv preprint arXiv:1707.04960 (2017)

[11]
Zhang, K., Chao, W.L., Sha, F., Grauman, K.:
Video summarization with long shortterm memory.
In: European Conference on Computer Vision, Springer (2016) 766–782  [12] Sadeghian, A., Sundaram, L., Wang, D.Z., Hamilton, W.F., Branting, K., Pfeifer, C.: Automatic semantic edge labeling over legal citation graphs. Artificial Intelligence and Law 26(2) (2018) 127–144

[13]
Kulesza, A., Taskar, B., et al.:
Determinantal point processes for machine learning.
Foundations and Trends® in Machine Learning 5(2–3) (2012) 123–286  [14] Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML11). (2011) 1017–1024
 [15] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in neural information processing systems. (2014) 3104–3112
 [16] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
 [17] Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Advances in Neural Information Processing Systems. (2015) 2773–2781
 [18] Serban, I.V., Sordoni, A., Bengio, Y., Courville, A.C., Pineau, J.: Building endtoend dialogue systems using generative hierarchical neural network models. In: AAAI. (2016) 3776–3784
 [19] Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732 (2015)
 [20] Wiseman, S., Rush, A.M.: Sequencetosequence learning as beamsearch optimization. arXiv preprint arXiv:1606.02960 (2016)
 [21] Kulesza, A., Taskar, B.: kdpps: Fixedsize determinantal point processes. In: Proceedings of the 28th International Conference on Machine Learning (ICML). (2011) 1193–1200
 [22] Lee, Y.J., Ghosh, J., Grauman, K.: Discovering important people and objects for egocentric video summarization. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 1346–1353
 [23] Fathi, A., Hodgins, J.K., Rehg, J.M.: Social interactions: A firstperson perspective. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, IEEE (2012) 1226–1233
 [24] Khosla, A., Hamid, R., Lin, C.J., Sundaresan, N.: Largescale video summarization using webimage priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013) 2698–2705
 [25] Kim, G., Sigal, L., Xing, E.P.: Joint summarization of largescale collections of web images and videos for storyline reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2014) 4225–4232
 [26] Xiong, B., Grauman, K.: Detecting snap points in egocentric video with a web photo prior. In: European Conference on Computer Vision, Springer (2014) 282–298
 [27] Chu, W.S., Song, Y., Jaimes, A.: Video cosummarization: Video summarization by visual cooccurrence. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3584–3592
 [28] Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: Summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 5179–5187
 [29] Liu, W., Mei, T., Zhang, Y., Che, C., Luo, J.: Multitask deep visualsemantic embedding for video thumbnail selection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3707–3715
 [30] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequencevideo to text. In: Proceedings of the IEEE international conference on computer vision. (2015) 4534–4542
 [31] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. (2015) 2048–2057
 [32] Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out: Proceedings of the ACL04 workshop. Volume 8., Barcelona, Spain (2004)
 [33] Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting on association for computational linguistics, Association for Computational Linguistics (2002) 311–318
 [34] Hough, J.B., Krishnapur, M., Peres, Y., Virág, B., et al.: Determinantal processes and independence. Probability surveys 3 (2006) 206–229
 [35] Borodin, A., Rains, E.M.: Eynard–mehta theorem, schur process, and their pfaffian analogs. Journal of statistical physics 121(3) (2005) 291–317
 [36] Gong, B., Chao, W., Grauman, K., Sha, F.: Diverse sequential subset selection for supervised video summarization. In: Advances in Neural Information Processing Systems (NIPS). (2014) 2069–2077
 [37] Daumé, H., Langford, J., Marcu, D.: Searchbased structured prediction. Machine learning 75(3) (2009) 297–325

[38]
Ross, S., Gordon, G.J., Bagnell, D.:
A reduction of imitation learning and structured prediction to noregret online learning.
In: International Conference on Artificial Intelligence and Statistics. (2011) 627–635 
[39]
Collins, M., Roark, B.:
Incremental parsing with the perceptron algorithm.
In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, Association for Computational Linguistics (2004) 111  [40] Shen, S., Cheng, Y., He, Z., He, W., Wu, H., Sun, M., Liu, Y.: Minimum risk training for neural machine translation. arXiv preprint arXiv:1512.02433 (2015)
 [41] Kulesza, A., Taskar, B.: Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning 5(2–3) (2012) 123–286
 [42] Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: European conference on computer vision, Springer (2014) 505–520
 [43] De Avila, S.E.F., Lopes, A.P.B., da Luz Jr, A., de Albuquerque Araújo, A.: Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters 32(1) (2011) 56–68
 [44] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2015) 1–9
 [45] Lee, Y.J., Grauman, K.: Predicting important objects for egocentric video summarization. International Journal of Computer Vision 114(1) (2015) 38–55
 [46] Lu, Z., Grauman, K.: Storydriven summarization for egocentric video. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2013) 2714–2721
 [47] Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Summary transfer: Exemplarbased subset selection for video summarization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2016) 1059–1067
 [48] Yeung, S., Fathi, A., FeiFei, L.: Videoset: Video summary evaluation through text. arXiv preprint arXiv:1406.5824 (2014)
 [49] Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S.F.: Largescale visual sentiment ontology and detectors using adjective noun pairs. In: Proceedings of the 21st ACM international conference on Multimedia, ACM (2013) 223–232
Comments
There are no comments yet.