Large-Margin Determinantal Point Processes

11/06/2014 ∙ by Boqing Gong, et al. ∙ University of Southern California The University of Texas at Austin 0

Determinantal point processes (DPPs) offer a powerful approach to modeling diversity in many applications where the goal is to select a diverse subset. We study the problem of learning the parameters (the kernel matrix) of a DPP from labeled training data. We make two contributions. First, we show how to reparameterize a DPP's kernel matrix with multiple kernel functions, thus enhancing modeling flexibility. Second, we propose a novel parameter estimation technique based on the principle of large margin separation. In contrast to the state-of-the-art method of maximum likelihood estimation, our large-margin loss function explicitly models errors in selecting the target subsets, and it can be customized to trade off different types of errors (precision vs. recall). Extensive empirical studies validate our contributions, including applications on challenging document and video summarization, where flexibility in modeling the kernel matrix and balancing different errors is indispensable.



There are no comments yet.


page 15

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Imagine we are to design a search engine to retrieve web images that match user queries. In response to the search term jaguar, what should we retrieve — the images of the animal jaguar or the images of the automobile jaguar?

This frequently cited example illustrates the need to incorporate the notion of diversity. In many tasks, we want to select a subset of items from a “ground set”. While the ground set might contain many similar items, our goal is not to discover all of the same ones, but rather to find a subset of diverse items that ensure coverage (the exact definition of coverage is task-specific). In the example of retrieving images for jaguar, we achieve diversity by including both types of images.

Recently, the determinantal point process (DPP) has emerged as a promising technique for modeling diversity [1]

. A DPP defines a probability distribution over the power set of a ground set. Intuitively, subsets of higher diversity are assigned larger probabilities, and thus are more likely to be selected than those with lower diversity. Since its original application to quantum physics, DPP has found many applications in modeling random trees and graphs 


, document summarization 

[3], search and ranking in information retrieval [4], and clustering [5]. Various extensions have also been studied, including k-DPP [4], structured DPP [6], Markov DPP [7], and DPP on continuous spaces [8].

The probability distribution of a DPP depends crucially on its kernel — a square and symmetric, positive semidefinite matrix whose elements specify how similar every pair of items in the ground set are. This kernel matrix is often unknown and needs to be estimated from training data.

This is a very challenging problem for several reasons. First, the number of the parameters, i.e., the number of elements in the kernel matrix, is quadratic in the number of items in the ground set. For many tasks (for instance, image search), the ground set can be very large. Thus it is impractical to directly specify every element of the matrix, and a suitable reparameterization of the matrix is necessary. Secondly, the number of training samples is often limited in many practical applications. One such example is the task of document summarization, where our aim is to select a succinct subset of sentences from a long document. There, acquiring accurate annotations from human experts is costly and difficult. Thirdly, for many tasks, we need to evaluate the performance of the learned DPP not only by its accuracy in predicting whether an item should be selected, but also by other measures like precision and recall. For instance, failing to select key sentences for summarizing documents might be regarded as being more catastrophic than injecting sentences with repetitive information into the summary.

Existing methods of parameter estimation for DPPs are inadequate to address these challenges. For example, maximum likelihood estimation (MLE) typically requires a large number of training samples in order to estimate the underlying model correctly. This also limits the number of the parameters it can estimate reliably, restricting its use to DPPs whose kernels can be parameterized with few degrees of freedom. It also does not offer fine control over precision and recall.

We propose a two-pronged approach for learning a DPP from labeled data. First, we improve modeling flexibility by reparameterizing the DPP’s kernel matrix with multiple base kernels. This representation could easily incorporate domain knowledge and requires learning fewer parameters (instead of the whole kernel matrix). Then, we optimize the parameters such that the probability of the correct subset is larger than other erroneous subsets by a large margin. This margin is task-specific and can be customized to reflect the desired performance measure—for example, to monitor precision and recall. As such, our approach defines objective functions that closely track selection errors and work well with few training samples. While the principle of large margin separation has been widely used in classification [9] and structured prediction [10], formulating DPP learning with the large margin principle is novel. Our empirical studies show that the proposed method attains superior performance on two challenging tasks of practical interest: document and video summarization.

The rest of the paper is organized as follows. We provide background on the DPP in section 2, followed by our approach in section 3. We discuss related work in section 4 and report our empirical studies in section 5. We conclude in section 6.

2 Background: Determinantal point processes

We first review background on the determinantal point process (DPP) [11] and the standard maximum likelihood estimation technique for learning DPP parameters from data. More details can be found in the excellent tutorial [1].

Given a ground set of items, , a DPP defines a probabilistic measure over the power set, i.e., all possible subsets (including the empty set) of . Concretely, let denote a symmetric and positive semidefinite matrix in . The probability of selecting a subset is given by


where denotes the submatrix of , with rows and columns selected by the indices in .

is the identity matrix with the proper size. We define

. The above way of defining a DPP is called an L-ensemble. An equivalent way of defining a DPP is to use a kernel matrix to define the marginal probability of selecting a random subset:


where we sum over all subsets that contain ( is an indicator function). The matrix is another positive semidefinite matrix, computable from the matrix


and is the submatrix of indexed by . Despite the exponential number of summands in eq. (2), the marginalization is analytically tractable and computable in polynomial time.

Modeling diversity

One particularly useful property of the DPP is its ability to model pairwise repulsion. Consider the marginal probability of having two items and simultaneously in a subset:


Thus, unless , the probability of observing and jointly is always less than observing either or separately. Namely, having in a subset repulsively excludes and vice versa. Another extreme case is when and are the same; then , which leads to . Namely, we should never allow them together in any subset.

Consequently, a subset with a large (marginal) probability cannot have too many items that are similar to each other (i.e., with high values of ). In other words, the probability provides a gauge of the diversity of the subset. The most diverse subset, which balances all the pairwise repulsions, is the subset that attains the highest probability


Note that this MAP inference is computed with respect to the L-ensemble (instead of ) as we are interested in the mode, not the marginal probability of having the subset. Unfortunately, the MAP inference is NP-hard [12]. Various approximation algorithms have been investigated [13, 1].

Maximum likelihood estimation (MLE)

Suppose we are given a training set , where each ground set is annotated with its most diverse subset . How can we discover the underlying parameters or ? Note that different ground sets need not have overlap. Thus, directly specifying kernel values for every pair of items is unlikely to be scalable. Instead, we will need to assume that either or for each ground set is represented by a shared set of parameters .

For items and in , suppose their kernel values can be computed as a function of , and , where and are features characterizing those items. Our learning objective is to optimize such that is the most diverse subset in , or attains the highest probability. This gives rise to the following maximum likelihood estimate (MLE) [3],


where converts features in to the matrix for the ground set . MLE has been a standard approach for estimating DPP parameters. However, as we will discuss in section 3.2, it has important limitations.

Next, we introduce our method for learning the parameters. We first present our multiple kernel based representation of the matrix and then the large-margin based estimation.

3 Our Approach

Our approach consists of two components that are developed in parallel, yet work in concert: (1) the use of multiple kernel functions to represent the DPP; (2) applying the principle of large margin separation to optimize the parameters. The former reduces the number of parameters to learn and thus is especially advantageous when the number of training samples is limited. The latter strengthens the advantage by optimizing objective functions that closely track subset selection errors.

3.1 Multiple kernel representation of a DPP

Learning the or matrix for a DPP is an instance of learning kernel functions, as those matrices are positive semidefinite matrices, interpretable as kernel functions being evaluated on the items in the ground set. Thus, our goal is essentially to learn the right kernel function to measure similarity.

However, for many applications, similarity is just one of the criteria for selecting items. For instance, in the previous example of image retrieval, the retrieved images not only need to be diverse (thus different) but also need to have strong relevance to the query term. Similarly, in document summarization, the selected sentences not only need to be succinct and not redundant, but also need to represent the contents of the document 


Kulesza and Taskar [3] propose to balance these two potentially conflicting forces with a decomposable matrix:


where is referred to as the quality factor, modeling how representative or relevant the selected items are. It depends on item

’s feature vector

, which encodes ’s contextual information and its representativeness of other items. For example, in document summarization, possible features are the sentence lengths, positions of the sentences in the text, or others. , on the other hand, measures how similar two sentences are, computed from a different set of features, and , such as bag-of-words descriptors that represent each item’s individual characteristics.

However, prior work [3] does not investigate whether this specific definition of similarity could be made optimal and adapted to the data, thus limiting the modeling power of the DPP largely to infer the quality . Our empirical studies show that this limitation can be severe, especially when the modeling choice is erroneous (cf. section 5.1).

In this paper, we retain the aspect of quality modeling but improve the modeling of similarity in two ways. First, we use nonlinear kernel functions such as the Gaussian RBF kernel to determine similarity. Secondly, and more importantly, we combine several base kernels:


where indexes the base kernels and is a scaling factor. The combination coefficients are constrained such that . They are optimized on the annotated data, either via maximum likelihood estimation or via our novel parameter estimation technique, to be described next.

3.2 Large-margin estimation of DPP

Maximum likelihood estimation does not closely track discriminative errors [15, 9, 16]. While improving the likelihood of the ground-truth subset , MLE could also improve the likelihoods of other competing subsets. Consequentially, a model learned with MLE could have modes that are very different subsets yet are very close to each other in their probability values. Having highly confusable modes is especially problematic for DPP’s NP-hard MAP inference — the difference between such modes can fall within the approximation errors of approximate inference algorithms such that the true MAP cannot be easily extracted.

Multiplicative large margin constraints

To address these deficiencies, our large-margin based approach aims to maintain or increase the margin between the correct subset and alternative, incorrect ones. Specifically, we formulate the following large margin constraints


where is a loss function measuring the discrepancy between the correct subset and an alternative . We assume .

Intuitively, the more different is from , the larger the gap we want to maintain between the two probabilities. This way, the incorrect one has less chance to be identified as the most diverse one. Note that while similar intuitions have been explored in multiway classification and structured prediction, the margin here is multiplicative instead of additive — this is by design, as it leads to a tractable optimization over the exponential number of constraints, as we will explain later.

Design of the loss function

A natural choice for the loss function is the Hamming distance between and , counting the number of disagreements between two subsets:


In this loss function, failing to select the right item costs the same as adding an unnecessary item. In many tasks, however, this symmetry does not hold. For example, in summarizing a document, omitting a key sentence has more severe consequences than adding a (trivial) sentence.

To balance these two types of errors, we introduce the generalized Hamming loss function,


When is greater than 1, the learning biases towards higher recall to select as many items in as possible. When is significantly less than 1, the learning biases towards high precision to avoid incorrect items as much as possible. Our empirical studies demonstrate such flexibility and its advantages in two real-world summarization tasks.

Numerical optimization

To overcome the challenge of dealing with an exponential number of constraints in eq. (9), we reformulate it as a tractable optimization problem. We first upper-bound the hard-max operation with Jensen’s inequality (i.e., softmax):


With the loss function , the right-hand-side is computable in polynomial time,


where is the -th element on the diagonal of , the marginal kernel matrix corresponding to . The detailed derivation of this result is in the supplementary material. Note that can be computed efficiently from through the identity eq. (3).

The can be seen as a summary of all undesirable subsets (the correct subset does not contribute to the weighted sum as ). Our optimization balances this term with the likelihood of the target with the hinge loss function


where is a tradeoff coefficient, to be tuned on validation datasets. Note that this objective function subsumes maximum likelihood estimation where . We optimize the objective function with subgradient descent. Details are in the supplementary material.

4 Related work

The DPP arises from random matrix theory and quantum physics 

[11, 1]

. In machine learning, researchers have proposed different variations to improve its modeling capacity. Kulesza and Taskar introduced k-DPP to restrict the sets to have a constant size

 [4]. Affandi et al. proposed a Markov DPP which offers diversity at adjacent time stamps [7]. A structured DPP was presented in [6] to model trees and graphs. The MAP inference of DPP is generally NP-hard [12]. Gillenwater et al. developed an 1/4-approximation algorithm [13]. In practice, greedy inference gives rise to decent results [3] though it lacks theoretical guarantees. Another popular alternative is to resort to fast sampling algorithms [5, 1].

In spite of much research activity surrounding DPPs, there is very little work exploring how to effectively learn the model parameters. MLE is the most popular estimator. Compared to MLE, our approach is more robust to the number of training data or mis-specified models, and offers greater flexibility by incorporating customizable error functions. A recent Bayesian approach works with the posterior over the parameters [17]. In contrast to that work, we develop a large-margin training approach for DPPs and directly minimize the set selection errors. The large-margin principle has been widely used in classification [9] and structured prediction [10, 18, 19, 20], but its application to DPP is original. In order to make it tractable for DPPs, we use multiplicative rather than additive margin constraints.

5 Experiments

We validate our large-margin approach to learn DPP parameters with extensive empirical studies on both synthetic data and two real-world summarization tasks with documents and videos. While DPP also has applications beyond summarization, this is a particularly good testbed to illustrate diverse subset selection: a compact summary ought to include high quality items that, taken together, offer good coverage of the source content. We report key results in this section, and provide more extensive results in the supplementary material.

5.1 Synthetic dataset


Our ground set has 10 items, . For each item, we sample a 5-dimensional feature vector from a spherical Gaussian: . To generate the matrix for the DPP, we follow the model in eq. (7); for the parameter vector we sample from a spherical Gaussian, , and for the similarity we simply let and compute .

We identify the most diverse subset (eq. (5)) via exhaustive search of all subsets, which is possible given the small ground set. The resulting has 5 items on average. We then add noise by randomly (with probability 0.1) adding or dropping an item to or from . We repeat the process of sampling another pair of the ground set and its most diverse set. We do so 200 times and use 100 pairs for holdout and 100 for testing. We repeat the process to yield training sets of various sizes.

Evaluation metrics

We evaluate the quality of the selected subset against the ground-truth

using the F-score, which is the harmonic mean of precision and recall:


All three quantities are between 0 and 1, and higher values are better.

Learning and inference

We compare our large-margin approach using the Hamming loss (eq. (10)) to the standard MLE method for learning DPP parameters.111Adding a zero-mean Gaussian prior over while learning with MLE, as in [3], did not yield improvement.

All hyperparameters are tuned by cross-validation. After learning, we apply MAP inference to the testing ground sets.

(a) Learning only, with correctly specified
(b) Learning under mis-specified (# training instances = 200)
(c) Learning both and , with multiple kernel parameterization
Figure 1: On synthetic datasets, our method significantly outperforms the state-of-the-art parameter estimation technique  [3] in various learning settings. See text for details. Best viewed in color.


The DPP is parameterized by two things: for the quality of the items, and for the similarity among them. Since the ground-truth parameters are known to us, we conduct experiments to isolate the impact of learning either one.

Fig. 1(a) contrasts the two methods when learning only, assuming all are known and the ground-truths are used. Our method significantly outperforms . When the number of training samples is increased, the performance of our method generally improves and gets very close to the oracle’s performance, for which the true values of both and are used.

Fig. 1(b) examines the two methods in the setting of model mis-specification, where the values deliberately deviate from the true values. Specifically, we set them to where the bandwidth varies from small to large, while the true values are . All methods generally suffer. However, our method is fairly robust to the mis-specification while quickly deteriorates. Our advantage is likely due to our method’s focus on learning to reduce subset selection errors, whereas MLE focuses on learning the right probabilistic model (even if it is already mis-specified).

Fig. 1(c) compares the two methods when both and need to be learned from the data. We apply our multiple kernel parameterization technique to model , as in eq. (8), except is set to be zero to avoid including the ground-truth. We see that our parameterization overcomes the problems of model mis-specification in Fig. 1(b), demonstrating its effectiveness in approximating unknown similarities. In fact, both learning methods match the performance of the corresponding methods with ground-truth similarity values, respectively. Nonetheless, our large-margin estimation still outperforms MLE significantly.

In summary, our results on synthetic data are very encouraging. Our multiple kernel parameterization avoids the pitfall of model mis-specification, and the large margin estimation outperforms MLE due to its ability to track selection errors more closely.

5.2 Document summarization

Next we apply DPP to the task of extractive multi-document summarization [21, 3, 14]. In this task, the input is a document cluster consisting of several documents on a single topic. The desired output is a subset of the sentences in the cluster that serve as a summary for the entire cluster. Naturally, we want the sentences in this subset to be both representative and diverse.


We use the text data from Document Understanding Conference (DUC) 2003 and 2004 [21] as the training and testing sets, respectively. There are 60 document clusters in DUC 2003 and 50 in DUC 2004, each collected over a short time period on a single topic. A cluster includes 10 news articles and on average 250 sentences. Four human reference summaries are provided along with each cluster. Following prior work, we generate the oracle/ground-truth summary by identifying a subset of the original sentences that best agree with the human reference summaries [3]. On average, the oracle summary consists of 5 sentences. As is standard practice, we use the oracles only during training. During testing, the algorithm output is evaluated against each of the four human reference summaries separately, and we report the average accuracy [21, 3, 14].

We use the widely-used evaluation package ROUGE [22], which scores document summaries based on -gram overlap statistics. We use ROUGE 1.5.5 along with WordNet 2.0, and report the F-score (F), Precision (P), and Recall (R) of both unigram and bigram matchings, denoted by ROUGE-1X and ROUGE-2X respectively (X {F, P, R}). Additionally, we limit the maximum length of each summary to be 665 characters to be consistent with existing work [21]. This yields 5 sentences on average for subsets generated by our algorithm.

Method rouge-1f rouge-1p rouge-1r rouge-2f rouge-2p rouge-2r
PEER 35 [21] 37.54 37.69 37.45 8.37
PEER 104 [21] 37.12 36.79 37.48 8.49
PEER 65 [21] 37.87 37.58 38.20 9.13
dpp+cos [3] 37.890.08 37.370.08 38.460.08 7.720.06 7.630.06 7.830.06
Ours (dpp+cos) 38.360.09 37.720.10 39.070.08 8.200.07 8.070.07 8.350.07
Ours (dpp+mkr) 39.140.08 39.030.09 39.310.09 9.250.08 9.240.08 9.270.08
Ours (dpp+mkr) 39.710.05 39.610.08 39.870.06 9.400.08 9.380.08 9.430.08
Table 1: Accuracy on document summarization. Our methods outperform others with statistical significance.

To allow the fairest comparison to existing DPP work for this task, we use the same features designated in [3]. To model quality, the features are the sentence length, position in the original document, mean cluster similarity, LexRank [23], and personal pronouns. To model the similarity, the features are the standard normalized term frequency-inverse document frequency (tf-idf) vectors.


We consider two ways of modeling similarities. The first one is to use the cosine similarity (

cos) between feature vectors, as in [3]. The second is our multiple kernel based similarity (mkr, eq. (8)). For mkr, the bandwidths are , , and the combination coefficients are learned on the data. We implement the method in [3] as a baseline (dpp+cos). We also test an enhanced variant of that method by replacing its cosine similarity with our multiple kernel based similarity (dpp+mkr).


Table 1 compares several DPP-based methods, as well as the top three results (PEER 35, 104, 65) from the DUC 2004 competition, which are not DPP-based (“-” indicates results not available). Since the DPP MAP inference is NP-hard, we use a sampling technique to extract the most diverse subset [1]

. We run inference 10 times and report the mean accuracy and standard error.

The state-of-the-art MLE-trained DPP model (+cos[3] achieves about the same performance as the best PEER results of DUC 2004. We obtain a noticeable improvement by applying our large-margin estimation (+cos). By applying multiple kernels to model similarity, we obtain significant improvements (above the standard errors) for both parameter estimation techniques. In particular, our complete method, +mkr

, attains the best performance across all the evaluation metrics.

Metric VSUMM1 [24] VSUMM2 [24] dpp+mkr Ours (dpp+mkr)
F-score 70.25 68.20 72.940.08 71.250.09 73.460.07 72.390.10
Precision 70.57 73.14 68.400.08 74.000.09 69.680.08 67.190.11
Recall 75.77 69.14 82.510.11 72.710.11 81.390.09 83.240.09
Table 2: Accuracy on video summarization. Our method performs the best and allows precision-recall control.

5.3 Video summarization

Finally, we demonstrate the broad applicability of our method by applying it to video summarization. In this case, the goal is to select a set of representative and diverse frames from a video sequence.


The dataset consists of 50 videos from the Open Video Project (ovp)222The Open Video Project: They are 30fps, 352240 pixels, vary from 1 to 4 minutes, and are distributed across several genres including documentary, educational, historical, etc. We use the provided ground truth key frame summaries [24], where each video is labeled by five annotators independently. We perform 5-fold validation and report the average result. We apply several preprocessing steps to remove frames that are trivially redundant (due to high temporal correlation) or of low visual quality. We use a similar procedure as in the document summarization task to generate the ground-truth subsets. On average, the ground-truth has 9 frames (in contrast, our method yields subsets from 5 to 20 frames). We use the public evaluation package VSUMM to evaluate the system-generated summary frames and again compute Precision, Recall and F-score [24]. More details are in the supplementary material.


We extract from each frame a color histogram and SIFT-based Fisher vector [25, 26] to model pairwise frame similarity . The two features are combined via our multiple kernel representation. To model the quality of each frame, we extract both intra-frame and inter-frame representativeness features. They are computed on the saliency maps [27, 28]

and include the mean, standard deviation, median, and quantiles of the maps as well as the the visual similarities between a frame and its neighbors. We z-score them within each video sequence.


Table 2 compares several methods for selecting key frames: an unsupervised clustering method VSUMM [24] (we implemented its two variants, offering a degree of tradeoff between precision and recall, and finely tuned the parameters), with a multiple kernel parameterization of , and our margin-based approach. For our method, we illustrate its flexibility to target different operating points, by varying the tradeoff constant in the generalized Hamming distance loss function eq. (11). Recall that higher values of will promote higher recall, while lower promote higher precision.

The results clearly demonstrate the advantage of our approach, particularly in how it offers finer control of the tradeoff between precision and recall. By adjusting , our method performs the best in each of the three metrics and outperforms the baselines by a statistically significant margin measured in the standard errors. Controlling the tradeoff is quite valuable in this application; for example, high precision may be preferable to a user summarizing a video he himself captured (he knows what appeared in the video, and wants a noise-free summary), whereas high recall may be preferable to a user summarizing a video taken by a third party (he has not seen the original video, and prefers some noise to dropped frames). More detailed analysis, including exemplar video frames, are provided in the supplementary material.

6 Conclusion

The determinantal point process (DPP) offers a powerful and probabilistically grounded approach for selecting diverse subsets. We proposed a novel technique for learning DPPs from annotated data. In contrast to the status quo of maximum likelihood estimation, our method is more flexible in modeling pairwise similarity and avoids the pitfall of model mis-specification. Empirical results demonstrate its advantages on both synthetic datasets and challenging real-world summarization applications.


  • [1] Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and Trends® in Machine Learning, 5(2-3):123–286, 2012.
  • [2] Robert Burton and Robin Pemantle. Local characteristics, entropy and limit theorems for spanning trees and domino tilings via transfer-impedances. The Annals of Probability, pages 1329–1371, 1993.
  • [3] Alex Kulesza and Ben Taskar. Learning determinantal point processes. In UAI, 2011.
  • [4] Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. In ICML, 2011.
  • [5] Byungkon Kang. Fast determinantal point process sampling with application to clustering. In NIPS, 2013.
  • [6] A. Kulesza and B. Taskar. Structured determinantal point processes. In NIPS, 2011.
  • [7] R. H. Affandi, A. Kulesza, and E. B. Fox. Markov determinantal point processes. In UAI, 2012.
  • [8] Raja Hafiz Affandi, Emily B Fox, and Ben Taskar. Approximate inference in continuous determinantal point processes. In NIPS, 2013.
  • [9] Vladimir Vapnik. Statistical learning theory. 1998. Wiley, New York, 1998.
  • [10] Ben Taskar, Vassil Chatalbashev, Daphne Koller, and Carlos Guestrin. Learning structured prediction models: A large margin approach. In ICML, 2005.
  • [11] Odile Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1):83–122, 1975.
  • [12] Chun-Wa Ko, Jon Lee, and Maurice Queyranne. An exact algorithm for maximum entropy sampling. Operations Research, 43(4):684–691, 1995.
  • [13] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinantal point processes. In NIPS, 2012.
  • [14] Hui Lin and Jeff Bilmes. Multi-document summarization via budgeted maximization of submodular functions. In NAACL/HLT, 2010.
  • [15] Andrew Y Ng and Michael I Jordan.

    On discriminative vs. generative classifiers: A comparison of logistic regression and naive bayes.

    In NIPS, 2002.
  • [16] Tony Jebara. Machine learning: discriminative and generative. Springer, 2004.
  • [17] Raja Hafiz Affandi, Emily B. Fox, Ryan P. Adams, and Ben Taskar. Learning the parameters of determinantal point process kernels. In ICML, 2014.
  • [18] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. In ICML, 2004.
  • [19] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks. In NIPS, 2004.
  • [20] Fei Sha and Lawrence K Saul.

    Large margin hidden markov models for automatic speech recognition.

    In NIPS, 2006.
  • [21] Hoa Trang Dang. Overview of duc 2005. In Document Understanding Conf., 2005.
  • [22] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text Summarization Branches Out: Proc. of the ACL-04 Workshop, 2004.
  • [23] Günes Erkan and Dragomir R Radev. Lexrank: Graph-based lexical centrality as salience in text summarization. JAIR, 22(1):457–479, 2004.
  • [24] Sandra Eliza Fontes de Avila, Ana Paula Brandão Lopes, et al. Vsumm: A mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters, 32(1):56–68, 2011.
  • [25] David G Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004.
  • [26] Florent Perronnin and Christopher Dance. Fisher kernels on visual vocabularies for image categorization. In CVPR, 2007.
  • [27] Esa Rahtu, Juho Kannala, Mikko Salo, and Janne Heikkilä. Segmenting salient objects from images and videos. In ECCV, 2010.
  • [28] Xiaodi Hou, Jonathan Harel, and Christof Koch. Image signature: Highlighting sparse salient regions. T-PAMI, 34(1):194–201, 2012.
  • [29] William H Beyer. CRC standard mathematical tables and formulae. CRC press, 1991.
  • [30] Vaibhava Goel and William J Byrne. Minimum bayes-risk automatic speech recognition. Computer Speech & Language, 14(2):115–135, 2000.


Appendix A Calculating the softmax (cf. eq. (13))

In the main text, we use softmax to deal with the exponential number of large-margin constraints and arrive at eq. (13). Here we show how to calculate the right-hand side of eq. (13).

Firstly, we compute as follows


where is the marginal probability of selecting item . Now we are ready to see


Moreover, recall that . Eigen-decomposing , we have


Appendix B Subgradients of the objective function (cf. eq. (14))

Recall that our objective function in eq. (14) actually consists of a likelihood term and the other term of undesirable subsets. Denote them respectively by


For brevity, we drop the subscript of and and change to in what follows.

To compute the overall subgradients, it is sufficient to compute the gradients of the above two terms, and . Denoting by , we have



stands for the element-wise product between two matrices of the same size. We use the chain rule to decompose

from the overall gradients on purpose. Therefore, if we change the way of parameterizing the DPP kernel , we only need care about when we compute the gradients for the new parameterization.

b.1 Gradients of the quality-diversity decomposition

In terms of the quality-diversity decomposition (c.f. eq. (7) and (8)), we have


where is the vector concatenating the quality terms , is the design matrix concatenating row by row, and stands for the standard unit vector with 1 at the -th entry and 0 elsewhere.

b.2 Gradients with respect to the DPP kernel

In what follows we calculate and in eq. (26). Noting that eq. (26) sums over all the pairs, we therefore do not need bother taking special care of the symmetric structure in .

We will need map “back” to a matrix which is the same size as the original matrix , such that and all the other entries of are zeros. We denote by such mapping, i.e., . Now we are ready to see,


It is a little more involved to compute


which involves .

In order to calculate , we start from the basic identity [29] of


followed by , where is the same size as . The -th entry of is 1 and all else are zeros.

Let . Noting that and thus , we have,


We can also write eq. (31) in the matrix form,


where is the -th column of .

Overall, we arrive at a concise form by writing out the right-hand-side of eq. (29) and merging some terms,


where looks like an identity matrix except that its -th entry is for .

Appendix C Minimum Bayes Risk decoding

We conduct the MAP inference of DPP by brute-forth search on the synthetic data, and turn to the so called minimum Bayes risk (MBR) decoding [30, 1] for larger ground sets on real data.

The MBR inference samples subsets from the learned DPP and outputs the one which achieves the highest consensus with the others, where the consensus can be measured by different evaluation metrics depending on applications. We use the F-score in our case. Particularly,


Note that the MBR inference has actually introduced some degrees of flexibility to DPP (and to other probabilistic models). It allows users to infer the desired output according to different evaluation metrics. As a result, the selected subset is not necessarily the “true” diverse subset, but is biased towards the users’ specific interests.

Appendix D Video summarization

We provide details on 1) how to generate oracle summaries as the supervised information to learn DPPs and 2) how to evaluate system-generated summaries against user summaries. We also present more results on balancing the precision and recall through our large-margin DPP.

d.1 Oracle summary

In the OVP dataset, each video comes along with five user summaries  [24]. Similar to document summarization [3], we extract an “oracle” summary from the five user summaries using a greedy algorithm. Initialize . From the frames not in , we pick out the one which contributes the most to the marginal gain,


where vsumm is the package developed in [24] to evaluate video summarization results. We postpone to Section D.2 for describing the evaluation scheme of vsumm. Namely, we select the oracle frames greedily for each video and stop until the marginal gain becomes negative. We evaluate the oracle summaries against users’ and find that they achieve high precision and recalls, 84.1% and 87.7% respectively, validating that the oracle summaries are able to serve as good supervised targets for training DPP models.

The above procedure allows a “user-independent” definition of a good oracle summary for learning. Of course if the application goal were to generate user-specific summaries catering to a particular user’s taste, one would instead simply apply our framework with set to be that particular user’s selection.

d.2 vsumm: evaluating video summarization results

We evaluate video summarization results using the vsumm package [24]. Given two sets of summaries/frames, it searches for the maximum number of matched pairs of frames between them. Two images are viewed as a matched pair if their visual difference is below a certain threshold. vsumm uses normalized color histograms to compute such difference. Besides, each frame of one set can be matched to at most one frame of the other set, and vice versa. After the matching procedure, one can hence develop different evaluation metrics based on the number of matched pairs. In our experiments, we define F-score, precision, and recall (cf. eq. (15) of the main text).

d.3 More results on balancing precision and recall

Figure 2: Balancing precision and recall. Through our large-margin DPPs (), we can balance precision and recall by varying in the generalized Hamming distance (cf. Section 3.2 in main text). In contrast, neither MLE nor VSUMM (the two variants in [24] are plotted together) is readily able to support such flexibility.
Figure 3: Video summaries generated by and our with , and , respectively. The oracle summary is also included for reference.

We present more results here on balancing precision and recall through our large-margin trained DPPs (). By varying from to

in the generalized Hamming distance (cf. Section 3.2 in the main text), we obtain 8 pairs of (precision, recall) values. We apply uniform interpolation among them and draw the precision-recall curve in Fig. 

2. One can see that is able to control the characteristics of the DPP generated summaries, baising them to either high precision or high recall and without sacrificing the other too much. Though MLE or VSUMM does not supply such modeling flexibility, we also include them in the figure for reference.

Besides, Fig. 3 shows some qualitative results. For this particular video, , with , and with all give rise to high recalls. Their output summaries are pretty lengthy, and may be boring to some users who just want to grasp something interesting to watch. By turning down the weight to , our dramatically improves the precision to 76% (in contrast to the 48% of ).