1 Introduction
Careful selection of items from a large collection underlies many machine learning applications. Notable examples include recommender systems, information retrieval and automatic summarization methods, among others. Typically, the selected set of items must fulfill a variety of application specific requirements—e.g., when recommending items to a user, the
quality of each selected item is important. This quality must be, however, balanced by the diversity of the selected items to avoid redundancy within recommendations.But balancing quality with diversity is challenging: as the collection size grows, the number of its subsets grows exponentially. A model that offers an elegant, tractable way to achieve this balance is a Determinantal Point Process (Dpp). Concretely, a Dpp models a distribution over subsets of a ground set that is parametrized by a semidefinite matrix , such that for any ,
(1) 
where is the submatrix of indexed by . Informally, represents the volume associated with subset , the diagonal entry represents the importance of item , while entry encodes similarity between items and . Since the normalization constant of (1) is simply , we have , which suggests why Dpps may be tractable despite their exponentially large sample space.
The key object defining a Dpp is its kernel matrix . This matrix may be fixed a priori using domain knowledge (Borodin, 2009), or as is more common in machine learning applications, learned from observations using maximum likelihood estimation (MLE) (Gillenwater et al., 2014; Mariet & Sra, 2015). However, while fitting observed subsets well, MLE for Dpps may also assign high likelihoods to unobserved subsets far from the underlying generative distribution Chao et al. (2015). MLEbased Dpp models may thus have modes corresponding to subsets that are close in likelihood, yet differ in how close they are to the true data distribution. Such confusable modes reduce the quality of the learned model, hurting predictions (see Figure 1).
Such concerns when learning generative models over huge sample spaces are not limited to the area of subsetselection: applications in image and text generation have been the driving force in developing techniques for generating highquality samples. Among their innovations, a particularly successful technique uses generated samples as “negative samples” to train a discriminator, which in turn encourages generation of more realistic samples; this is the key idea behind the Generative Adversarial Nets (GANs) introduced in
(Goodfellow et al., 2014).These observations motivate us to investigate the use of Dppgenerated samples with added perturbations as negatives, which we then incorporate into the learning task to improve the modeling power of Dpps. Intuitively, negative samples are those subsets that are far from the true data distribution, but to which the Dpp
erroneously assigns high probability. As there is no closed form way to generate such idealized negatives, we approximate them via an external “negative distribution”.
More precisely, we introduce a novel Dpp learning problem that incorporates samples from a negative distribution into traditional MLE. While the focus of our work is on generating the negative distribution jointly with , we also investigate outside sources of negative information. Ultimately, our formulation leads to an optimization problem harder than the original Dpp learning problem; we show that even approximate solutions greatly improve the performance of the Dpp model when evaluated on concrete tasks, such as identifying the best item to add to a subset of chosen objects (basketcompletion) and discriminating between heldout test data and randomly generated subsets.
) model can assign relatively high predictive probabilities to modes that represent incorrect predictions, resulting in higher symmetric KL divergence compared to the true empirical distribution and a high variance, revealing a high sensitivity to initalization. The
Dyn and Exp methods we introduce in Section 3 reduce this confusable mode issue, resulting in predictive distributions that are closer to the true distribution and much smaller variances.Contributions.
To our knowledge, this work is the first theoretical or empirical investigation of augmenting the Dpp learning problem with negative information.

Our first main contribution is the Contrastive Estimation (CE) model, which incorporates negative information through inferred negatives into the learning task.

We introduce static and dynamic models for CE and discuss the theoretical and practical tradeoffs of such choices. Static models leverage information that does not evolve over time, whereas dynamic models draw samples from a negative distribution that depends on the current model’s parameters; dynamic CE posits an optimization problem worthy of independent study.

We show how to learn CE models efficiently, and furthermore show that the complexity of conditioning a Dpp on a chosen sample can be brought from to essentially . This helps dynamic CE and removes a major bottleneck in computing nextitem predictions for a set.
Using findings obtained from extensive experiments conducted on small datasets, we show on a large dataset that CE learning significantly improves the modeling power of Dpps: CE learning improves Dpp performance for nextitem basket completion, as well as Dpp discriminative power, as evaluated by the model’s ability to distinguish heldout test data from randomly generated subsets.
We present a review of related work in Section 2. In Section 3, we introduce Contrastive Estimation and its dynamic and static variants. We discuss how the CE problem can be optimized efficiently in Section 4, as well as how Dpp conditioning for basketcompletion predictions can be performed with improved complexity. In Section 5, we show that CE learning leads to remarkable empirical improvements of Dpp performance metrics.
2 Background and related work
First introduced to model fermion behavior by Macchi (1975), Dpps have gained popularity due to their elegant balancing of quality and subset diversity. Dpps are studied both for their theoretical properties (Kulesza & Taskar, 2012; Borodin, 2009; Affandi et al., 2014; Kulesza, 2013; Gillenwater, 2014; Decreusefond et al., 2015; Lavancier et al., 2015) and their machine learning applications: object retrieval (Affandi et al., 2014), summarization (Lin & Bilmes, 2012; Chao et al., 2015), sensor placement (Krause et al., 2008), recommender systems (Gartrell et al., 2016)
, neural network compression
(Mariet & Sra, 2016a), and minibatch selection (Zhang et al., 2017).Gillenwater et al. (2014) study Dpp kernel learning via EM, while Mariet & Sra (2015) present a fixedpoint method. Dpp kernel learning has leveraged Kronecker (Mariet & Sra, 2016b) and lowrank (Dupuy & Bach, 2016; Gartrell et al., 2017) structures. Learning guarantees using Dpp graph properties are studied in (Urschel et al., 2017). Aside from Tschiatschek et al. (2016); Djolonga et al. (2016), who learn a Facility LocatIon Diversity (FLID) distribution (as well as more complex FLIC and FLDC models) by contrasting it with a “negative” product distribution, little attention has been given to using negative samples to learn richer subsetselection models.
Nonetheless, leveraging negative information is a widely used in other applications. In object detection, negative mining corrects for the skewed simpletodifficult negative distribution by training the model on its false positives
(Sung, 1996; Canévet & Fleuret, 2014; Shrivastava et al., 2016). In language modeling, Noise Contrastive Estimation (NCE)
(Gutmann & Hyvärinen, 2012), which tasks the model with distinguishing positive samples from generated negatives, was first applied in (Mnih & Teh, 2012) and has been instrumental in Word2Vec (Mikolov et al., 2013). Since then, variants using adaptive noise (Chen et al., 2017) have been introduced. NCE is also the method used by Tschiatschek et al. (2016) for subsetselection.An alternate approach to negative samples within submodular language models was introduced as Contrastive Estimation in (Smith & Eisner, 2005a, b). Negative sampling is also used in GANs (Goodfellow et al., 2014), where a generator network competes with a discriminative network which distinguishes between positives and generated negatives. An adversarial approach to Contrastive Estimation has been recently introduced in (Bose et al., 2018), where ideas from GANs for discrete data are used to implement an adversarial negative sampler that augments a conventional negative sampler.
3 Learning Dpps with negative samples
Motivated by the similarities between Dpp learning and crucial structured prediction problems in other ML fields, we introduce an optimization problem that leverages negative information. We refer to this problem as Contrastive Estimation (CE) due to its ties to a notion discussed in (Smith & Eisner, 2005a).
3.1 Contrastive Estimation
In conventional Dpp learning, we seek to maximize determinantal volumes of sets drawn from the true distribution (that we wish to model), by solving the following MLE problem, where samples in the training set are assumed to be drawn i.i.d.:
Find  (2) 
We augment problem (2) to incorporate additional information from a negative distribution , which we wish to have the Dpp distribution move away from. The ensuing optimization problem is the main focus of our paper.
Definition 1 (Contrastive Estimation).
Given a training set of positive samples on which is defined and a negative distribution over , we call Contrastive Estimation the problem
(3) 
where we write .
The expectation can be approximated by drawing a set of samples from : then becomes^{1}^{1}1With a slight abuse of notation, we continue writing despite the sample approximation to .
(4) 
If , the CE objective (3) reduces to . Conversely, can be viewed as a samplebased approximation of the value , where is the true distribution generating the samples in . Interestingly, another reformulation of (3) suggests an even broader class of Dpp kernel learning: indeed, let be (resp. ) for (resp. ), and define
where the should be viewed as belonging in with an additional normalization coefficient. Then, we can rewrite equation (4) in the following form
(5) 
Formulation (5) suggests the use of a broader scope of continuous labels ; we do not cover this variation in the present work, but note that (5) permits the use of weighted samples for learning.
Remark 1.
Compared to the traditional Noise Contrastive Estimation (NCE) approach, which requires full knowledge of the negative distribution, CE does not suffer any such limitation: we only require an estimate of .
Remark 2.
Remark 3.
CE is a nonconvex optimization problem, and thus admits the same guarantees as Dpp MLE learning when learned using Stochastic Gradient Ascent with decreasing step sizes; however, the convergence rate will depend on the choice of .
Indeed, to fully specify the CE problem one must first choose the negative distribution , or equivalently, choose a procedure to generate negative samples to obtain (4). We consider below two classes of distributions with considerably different ramifications: dynamic and static negatives; their analysis is the focus of the next two sections.
3.2 Dynamic negatives
In most applications leveraging negative information (e.g., negative mining, GANs), negative samples evolve over time based on the state of the learned model. We call any that depends on the state of the model a dynamic negative distribution: at iteration of the learning procedure with kernel estimate , we use a parametrized by .
More specifically, we focus on the setting where negative samples themselves are generated by the current Dpp, with the goal of reducing overfitting. Given a positive sample , we generate a negative by replacing with that yields a high probability (Alg. 1). We generate the samples probabilistically rather than via mode maximization so that a sample can lead to different negatives when we generate more negatives than positives.
As evolves along with , the second term of acts as a moving target that must be continuously estimated during the learning procedure. For this reason, we choose to optimize by a twostep procedure described in Alg. 2, similarly to an alternating maximization approach such as EM.
Note that this approach bears strong similarities with GANs, in which both the generator and discriminator evolve during training (dynamic negatives also appear in a discussion by Goodfellow (2014) as a theoretical tool to analyze the difference between NCE and GANs).
Once the generated negative has been used in an iteration of the optimization of , it is less likely to be sampled again.^{2}^{2}2If happens to be a false negative (i.e. appears in ), will be comparatively sampled more frequently as a positive, and so will contribute on average as a positive sample. Additional precautions such as the ones mentioned in (Bose et al., 2018) can also be leveraged if necessary. Crucially, such dynamic negatives also avoid the problem alluded to in Remark 2, since by construction they have a nonzero probability under at iteration .
3.3 Static negatives
Conversely, we can simplify the optimization problem by considering a static negative distribution: does not depend on the current kernel estimate. A considerable theoretical advantage of static negatives lies in the simpler optimization problem: given a static negative distribution , the optimization objective
does not evolve during training, and is amenable to a simple invocation of stochastic gradient descent
(Bottou, 1998).Theorem 1.
Let be a static distribution over and let be such that . Let be a bounded subspace of all positive semidefinite matrices of rank . Projected stochastic gradient ascent applied to the CE objective with negative distribution and space with step sizes such that , will converge to a critical point.
Note, however, that such distributions may suffer from the fundamental theoretical issue in Rem. 2, and hence careful attention must be paid to ensure that the learning algorithm does not converge to a spurious optimum that assigns a probability to . In practice, we observed that the local nature of stochastic gradient ascent iterations was sufficient to avoid such behavior.
Let us now discuss two classical choices for fixed .
Product negatives.
A common choice of negative distribution in other machine learning areas is the product distribution, which is the standard “noise” distribution used in NCE. It is defined by
(6) 
where is the empirical probability of in . Although (Mikolov et al., 2013) reports better results by raising the to the power , we did not observe any improvements when using exponentiated power distributions; for this reason, by product negatives, we always indicate the baseline distribution (6).
The product distribution is in practice a mismatch for Dpps, as it lacks the negative association property of Dpps which enables them to model the repulsive interactions between similar items^{3}^{3}3Dpps belong to the family of Strongly Rayleigh measures, which have been shown to verify a broad range of negatively associated properties; we refer the interested reader to the fascinating work (Pemantle, 2000; Borcea et al., 2009a, b; Borcea & Brändén, 2009a, b, 2010)..
Explicit negatives.
Alternatively, we may have prior knowledge of a class of subsets that our model should not generate. For example, we might know that items and are negatively correlated and hence unlikely to cooccur. We may also learn via user feedback that some generated subsets are inaccurate. We refer to negatives obtained using such outside information as explicit negatives.
A fundamental advantage of explicit negatives is that they allow us to incorporate prior knowledge and user feedback as part of the learning algorithm. The ability to incorporate such information, to our knowledge, is in itself a novel contribution to Dpp learning.
Although such knowledge may be costly and/or only available at rare intervals, a form of continuous learning that would regularly update the state of our prior knowledge (and hence ) would bring the explicit negative distribution into the realm of dynamic distributions, as described by Alg. 2.
4 Efficient learning and prediction
We now describe how the Contrastive Estimation problem for Dpps can be optimized efficiently. In order to efficiently generate dynamic negatives, which rely on Dpp conditioning, we additionally generalize the dual transformation leveraged in (Osogami et al., 2018) to speed up basketcompletion tasks with Dpps. This speedup impacts the broader use of Dpps, outside of CE learning.
4.1 Optimizing
We propose to optimize the CE problem by exploiting a lowrank factorization of the kernel, writing , where and is the rank of the kernel, which is fixed a priori.
This factorization ensures that the estimated kernel remains positive semidefinite, and enables us to leverage the lowrank computations derived in (Gartrell et al., 2017) and refined in (Osogami et al., 2018). Given the similar forms of the MLE and CE objectives, we use the traditional stochastic gradient ascent algorithm introduced by (Gartrell et al., 2017) to optimize (3). In the case of dynamic negatives, we regenerate after each gradient step; less frequent updates are also possible if the negative generation algorithm is very costly.
We furthermore augment with a regularization term , defined as
where counts the occurrences of in the training set,
is the corresponding row vector of
andis a tunable hyperparameter. Note that this is the same regularization as introduced in
(Gartrell et al., 2017). This regularization tempers the strength of , a term interpretable as to the popularity of item (Kulesza & Taskar, 2012; Gillenwater, 2014), based on its empirical popularity . Experimentally, we observe that adding has a strong impact on the predictive quality of our model.The reader may wonder if other approaches to Dpp learning are also applicable to the CE problem.
Remark 4.
Gradient ascent algorithms require that the estimate be projected onto the space of positive semidefinite matrices; however, doing so can lead to almostdiagonal kernels (Gillenwater et al., 2014) that cannot model negative interactions. Riemannian gradient ascent methods were considered, but deemed too computationally demanding by (Mariet & Sra, 2015). Furthermore, the update rule for the fixedpoint approach in (Mariet & Sra, 2015) does not admit a closed form solution for CE, rendering it impractical (App. A).
The lowrank formulation allows us to apply CE (as well as NCE, as discussed in Section 5) to learn large datasets such as the Belgian retail supermarket dataset (described in Section 5) without prohibitive learning runtimes. We show below that by leveraging the idea described in (Osogami et al., 2018), the lowrank formulation can also lead to additional speed ups during prediction.
4.2 Efficient conditioning for predictions
Dynamic negatives rely upon conditioning a Dpp on a chosen sample (see Alg. 1: can be efficiently computed for all by a preprocessing step that conditions on set ). For this reason, we now describe how lowrank Dpp conditioning can be significantly sped up.
In (Gartrell et al., 2017), conditioning has a cost of , where . Since for many datasets, this represents a significant bottleneck for conditioning and computing nextitem predictions for a set. We show here that this complexity can be brought down significantly.
Proposition 1.
Given and a Dpp of rank parametrized by , where , we can derive the conditional marginal probabilities in the Dpp parametrization in only time.
Proof.
Let be the lowrank parametrization of the Dpp kernel () and . As in Gillenwater (2014), we first compute the dual kernel , where . We then compute
with , and where is the Dpp kernel conditioned on the event that all items in are observed, and is the restriction of to the rows and columns indexed by .
Computing costs . Next, following Kulesza & Taskar (2012), we eigendecompose to compute the conditional (marginal) probability of every possible item in :
where is column vector for item in and ,
are an eigenvalue/vector of
.The computational complexity for computing the eigendecomposition is , and computing for all items in costs . Therefore, we have an overall computational complexity of for computing nextitem conditionals/predictions for the lowrank Dpp using the dual kernel, which is significantly superior to the typical cost of . ∎
As in most cases , this represents a substantial improvement, allowing us condition in time essentially linear in the size of the item catalog.
5 Experiments
(a) UK dataset  

Improvement over LowRank  
Metric  LowRank  Exp  Dyn 
MPR  80.07  3.75 0.16  3.74 0.16 
AUC  0.57297  0.41465 0.01334  0.41467 0.01339 
(b) Belgian dataset  

Improvement over LowRank  
LowRank  Exp  Dyn 
79.42  9.58 0.15  9.64 0.13 
0.6162  0.3705 8.569e5  0.3702 4.447e5 
We run nextitem prediction and AUCbased classification experiments^{4}^{4}4All code is implemented in Julia 0.6.1 and will be made publicly available upon publication. on two recommendation datasets for Dpp evaluation: the UK retail dataset (Chen, 2012), which after clipping all subsets to a maximum size^{5}^{5}5This allows us to use a lowrank matrix factorization for the Dpp that scales well in terms of train and prediction time. of 100, contains 4070 items and 20059 subsets, and the Belgian Retail Supermarket dataset^{6}^{6}6http://fimi.ua.ac.be/data/retail.pdf, which contains 88,163 subsets, of a total of 16,470 unique items (Brijs et al., 1999; Brijs, 2003). We compare the following Contrastive Estimation approaches:

Exp: explicit negatives learned with CE. As to our knowledge there are no datasets with explicit negative information, we generate approximations of explicit negatives by removing one item from a positive sample and replacing it with the least likely item (Algorithm 3).

Dyn: dynamic negatives learned with CE.
As our work revolves around improving Dpp performance, we focus on the two following baselines, which are targeted to learning Dpp parametrizations from data:

NCE: Noise Contrastive Estimation using product negatives.

LowRank: the standard lowrank Dpp stochastic gradient ascent algorithm from (Gartrell et al., 2017).
NCE learns a model by contrasting with negatives drawn from a “noisy” distribution , training the model to distinguish between sets drawn from and sets drawn from . NCE has gained popularity due to its ability to model distributions with untractable normalization coefficients, and has been shown to be a powerful technique to improve submodular recommendation models (Tschiatschek et al., 2016). NCE learns by maximizing the following conditional loglikelihood:
(7) 
In our experiments, we learn the NCE objective with stochastic gradient ascent for our lowrank model, since is given by
(8) 
where if and 0 otherwise.
This allows us to approximate true explicit negatives, as we use the empirical data to derive “implausible” sets. Note, however, that when using such negatives we have no guarantee that objective function will be well behaved, as opposed to the theoretically grounded dynamic negatives.
5.1 Experimental setup
The performance of all methods are compared using standard recommender system metrics: Mean Percentile Rank (MPR). MPR is a recallbased metric which evaluates the model’s predictive power by measuring how well it predicts the next item in a basket, and is a standard choice for recommender systems (Hu et al., 2008; Li et al., 2010).
Specifically, given a set , let . The percentile rank of an item given a set is defined as
The MPR is then computed as
where is the set of test instances and is a randomly selected element in each set . An MPR of 50 is equivalent to random selection; a MPR of 100 indicates that the model perfectly predicts the held out item.
We also evaluate the discriminative power of each model using the AUC metric. For this task, we generate a set of negative subsets uniformly at random. For each positive subset in the test set, we generate a negative subset of the same length by drawing samples uniformly at random, while ensuring that the same item is not drawn more than once for a subset. We then compute the AUC for the model on these positive and negative subsets, where the score for each subset is the loglikelihood that the model assigns to the subset. This task measures the ability of the model to discriminate between positive subsets (groundtruth subsets) and randomly generated subsets.
In all experiments, 80% of subsets are used for training; the remaining 20% served as test; convergence is reached when the relative change in the validation loglikelihood is below a predetermined threshold , set identically for all methods. All results are averaged over 5 learning trials.
5.2 Amazon registries
We conducted an experimental analysis on the largest 7 subdatasets included in the Amazon Registry dataset, which has become a standard dataset for Dpp modeling (Gillenwater et al., 2014; Mariet & Sra, 2015; Gartrell et al., 2017). Given the small size of these datasets (the largest has 100 items), these experiments serve only to provide insight into the general behavior of the baselines and CE methods as well as the influence of the hyperparameters on convergence.
Table 2 reports the average time to convergence for each method. As generating the dynamic negatives has a high complexity due to Dpp conditioning, Dyn is 2.7x slower than Exp. LowRank is the fastest method, as it does not need to process any negatives. NCE is by far the most timeconsuming.
Method  LowRank  Exp  Dyn  NCE 

Runtime  0.83 0.54  2.69 0.02  7.13 0.28  27.59 2.20 
We found that explicit and dynamic CE are not very sensitive to the and hyperparameters. For this reason, in all further results, we set and in all further experiments. In previous work on lowrank Dpp learning (Gartrell et al., 2017), was found to be a reasonably optimal value, ensuring a fair comparison between all methods.
Further experiments reporting the MPR, AUC and various precisions for the Amazon registries are described in App. B.
5.3 UK and Belgian Retail Datasets
Following (Gartrell et al., 2016), for both the UK and the belgian dataset, we set the rank of the kernel to be the size of the largest subset in the dataset (K=100 for the UK dataset, K=76 for the Belgian dataset): this optimizes memory costs while still modeling all groundtruth subsets. Based on our results on the smaller Amazon dataset, we fix and .
Finally, corroborating our timing results on the Amazon registry, we saw that one iteration of NCE required nearly 11 hours on the Belgian dataset (compared to 5 minutes for one iteration of CE). For this reason, we remove NCE as a baseline from all remaining experiments, as it is not feasible in the general case.
Tables 4 (a) and (b) summarize our results; the negative methods show significant MPR improvement over LowRank, with both Dyn and Exp
performing almost 10 points higher on the Belgian dataset, and 3 points higher on the UK dataset. This is a striking improvement, compounded by small standard deviations confirming that these results are robust to matrix initialization.
We also see a dramatic improvement over LowRank in AUC, with an improvement of approximately 0.41 for the UK dataset and 0.37 for the Belgian dataset, across both Dyn and Exp methods. Both Dyn and Exp perform quite well, with an AUC score of approximately 0.9864 or higher for both models. These results suggest that for larger datasets, CE can be effective at improving the discriminative power of the Dpp.
6 Conclusion and future work
We introduce the Contrastive Estimation (CE) optimization problem, which optimizes the difference of the traditional Dpp loglikelihood and the expectation of the Dpp model’s loglikelihood under a negative distribution . This increases the Dpp’s fit to the data while simultaneously incorporating inferred or explicit domain knowledge into the learning procedure.
CE lends itself to intuitively similar but theoretically different variants, depending on the choice of : a static leads to significantly faster learning but allows spurious optima; conversely, allowing to evolve along with model parameters limits overfitting at the cost of a more complex optimization problem. Optimizing dynamic CE is in of itself a theoretical problem worthy of independent study.
Additionally, we show that lowrank Dpp conditioning complexity can be improved by a factor of by leveraging the dual representation of the lowrank kernel. This not only improves prediction speed on a trained model, but allows for more efficient dynamic negative generation.
Experimentally, we show that CE with dynamic and explicit negatives provide comparable, significant improvements in the predictive performance of Dpps, as well as on the learned Dpp’s ability to discriminate between real and randomly generated subsets.
Our analysis also raises both theoretical and practical questions: in particular, a key component of future work lies in better understanding how explicit domain knowledge can be incorporated into the generating logic for both dynamic and static negatives. Furthermore, the CE formulation in Eq. (5) suggests the possibility of using continuous labels for weighted samples within CE.
Acknowledgements.
This work was partially supported by a Criteo Faculty Research Award, and NSFIIS1409802.
References
 Affandi et al. (2014) Affandi, R., Fox, E., Adams, R., and Taskar, B. Learning the parameters of Determinantal Point Process kernels. In ICML, 2014.
 Borcea & Brändén (2009a) Borcea, Julius and Brändén, Petter. The LeeYang and PólyaSchur programs I Linear operators preserving stability. Inventiones mathematicae, 177(3), 2009a.
 Borcea & Brändén (2009b) Borcea, Julius and Brändén, Petter. The LeeYang and PólyaSchur programs II Theory of stable polynomials and applications. Communications on Pure and Applied Mathematics, 62(12), 2009b.
 Borcea & Brändén (2010) Borcea, Julius and Brändén, Petter. Multivariate Pólya–Schur classification problems in the Weyl algebra. Proceedings of the London Mathematical Society, 101(1), 2010.
 Borcea et al. (2009a) Borcea, Julius, Brändén, Petter, and Liggett, Thomas. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, 22(2), 2009a.
 Borcea et al. (2009b) Borcea, Julius, Brändén, Petter, and Shapiro, Boris. Classification of hyperbolicity and stability preservers: the multivariate Weyl algebra case. arXiv preprint math.CA/0606360, 2009b.
 Borodin (2009) Borodin, Alexei. Determinantal Point Processes. arXiv:0911.1153, 2009.
 Bose et al. (2018) Bose, Avishek, Ling, Huan, and Cao, Yanshuai. Adversarial contrastive estimation. arXiv preprint arXiv:1805.03642, 2018.
 Bottou (1998) Bottou, Léon. Online learning in neural networks. Cambridge University Press, 1998.
 Brijs (2003) Brijs, Tom. Retail market basket data set. In Workshop on Frequent Itemset Mining Implementations (FIMI’03), 2003.
 Brijs et al. (1999) Brijs, Tom, Swinnen, Gilbert, Vanhoof, Koen, and Wets, Geert. Using association rules for product assortment decisions: A case study. In SIGKDD. ACM, 1999.
 Canévet & Fleuret (2014) Canévet, Olivier and Fleuret, Francois. Efficient Sample Mining for Object Detection. In ACML, JMLR: Workshop and Conference Proceedings, 2014.

Chao et al. (2015)
Chao, WeiLun, Gong, Boqing, Grauman, Kristen, and Sha, Fei.
Largemargin Determinantal Point Processes.
In
Uncertainty in Artificial Intelligence (UAI)
, 2015.  Chen (2012) Chen, D. Data mining for the online retail industry: A case study of rfm modelbased customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, 19(3), August 2012.
 Chen et al. (2017) Chen, Long, Yuan, Fajie, Jose, Joemon M., and Zhang, Weinan. Improving negative sampling for word representation using selfembedded features. CoRR, abs/1710.09805, 2017.
 Decreusefond et al. (2015) Decreusefond, Laurent, Flint, Ian, Privault, Nicolas, and Torrisi, Giovanni Luca. Determinantal Point Processes, 2015.
 Djolonga et al. (2016) Djolonga, Josip, Tschiatschek, Sebastian, and Krause, Andreas. Variational inference in mixed probabilistic submodular models. In NIPS. Curran Associates, Inc., 2016.
 Dupuy & Bach (2016) Dupuy, Christophe and Bach, Francis. Learning Determinantal Point Processes in sublinear time, 2016.
 Gartrell et al. (2016) Gartrell, Mike, Paquet, Ulrich, and Koenigstein, Noam. Bayesian lowrank Determinantal Point Processes. In Sen, Shilad, Geyer, Werner, Freyne, Jill, and Castells, Pablo (eds.), RecSys. ACM, 2016.
 Gartrell et al. (2017) Gartrell, Mike, Paquet, Ulrich, and Koenigstein, Noam. Lowrank factorization of Determinantal Point Processes. In AAAI, 2017.
 Gillenwater (2014) Gillenwater, J. Approximate Inference for Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2014.
 Gillenwater et al. (2014) Gillenwater, J., Kulesza, A., Fox, E., and Taskar, B. Expectationmaximization for learning Determinantal Point Processes. In NIPS, 2014.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS. 2014.
 Goodfellow (2014) Goodfellow, Ian J. On distinguishability criteria for estimating generative models, 2014.
 Gutmann & Hyvärinen (2012) Gutmann, Michael U. and Hyvärinen, Aapo. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res., 13, February 2012. ISSN 15324435.
 Hu et al. (2008) Hu, Yifan, Koren, Yehuda, and Volinsky, Chris. Collaborative filtering for implicit feedback datasets. In ICDM, 2008.
 Krause et al. (2008) Krause, Andreas, Singh, Ajit, and Guestrin, Carlos. Nearoptimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. JMLR, 9, 2008.
 Kulesza (2013) Kulesza, A. Learning with Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2013.
 Kulesza & Taskar (2012) Kulesza, A. and Taskar, B. Determinantal Point Processes for machine learning, volume 5. Foundations and Trends in Machine Learning, 2012.
 Lavancier et al. (2015) Lavancier, Frédéric, Møller, Jesper, and Rubak, Ege. Determinantal Point Process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4), 2015.
 Li et al. (2010) Li, Yanen, Hu, Jia, Zhai, ChengXiang, and Chen, Ye. Improving oneclass collaborative filtering by incorporating rich user information. In CIKM, 2010.
 Lin & Bilmes (2012) Lin, H. and Bilmes, J. Learning mixtures of submodular shells with application to document summarization. In UAI, 2012.
 Macchi (1975) Macchi, O. The coincidence approach to stochastic point processes. Adv. Appl. Prob., 7(1), 1975.
 Mariet & Sra (2015) Mariet, Zelda and Sra, Suvrit. Fixedpoint algorithms for learning Determinantal Point Processes. In ICML, 2015.
 Mariet & Sra (2016a) Mariet, Zelda and Sra, Suvrit. Diversity networks. Int. Conf. on Learning Representations (ICLR), 2016a.
 Mariet & Sra (2016b) Mariet, Zelda and Sra, Suvrit. Kronecker Determinantal Point Processes. In NIPS, 2016b.
 Mikolov et al. (2013) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In NIPS. 2013.
 Mnih & Teh (2012) Mnih, Andriy and Teh, Yee Whye. A fast and simple algorithm for training neural probabilistic language models. In ICML, 2012.
 Osogami et al. (2018) Osogami, Takayuki, Raymond, Rudy, Goel, Akshay, Shirai, Tomoyuki, and Maehara, Takanori. Dynamic Determinantal Point Processes. In AAAI, 2018.
 Pemantle (2000) Pemantle, Robin. Towards a theory of negative dependence. Journal of Mathematical Physics, 41(3), 2000.
 Shrivastava et al. (2016) Shrivastava, Abhinav, Gupta, Abhinav, and Girshick, Ross. Training regionbased object detectors with online hard example mining. In CVPR, 2016.
 Smith & Eisner (2005a) Smith, Noah A. and Eisner, Jason. Guiding unsupervised grammar induction using contrastive estimation. In In Proc. of IJCAI Workshop on Grammatical Inference Applications, 2005a.
 Smith & Eisner (2005b) Smith, Noah A. and Eisner, Jason. Contrastive estimation: Training loglinear models on unlabeled data. In ACL, ACL ’05, 2005b.
 Sung (1996) Sung, Kah Kay. Learning and Example Selection for Object and Pattern Detection. PhD thesis, Massachusetts Institute of Technology, 1996.
 Tschiatschek et al. (2016) Tschiatschek, Sebastian, Djolonga, Josip, and Krause, Andreas. Learning probabilistic submodular diversity models via noise contrastive estimation. In AISTATS, 2016.

Urschel et al. (2017)
Urschel, John, Brunel, VictorEmmanuel, Moitra, Ankur, and Rigollet,
Philippe.
Learning Determinantal Point Processes with moments and cycles.
In ICML, 2017.  Zhang et al. (2017) Zhang, Cheng, Kjellström, Hedvig, and Mandt, Stephan. Stochastic learning on imbalanced data: Determinantal Point Processes for minibatch diversification. CoRR, abs/1705.00607, 2017.
Appendix A Contrastive Estimation with the Picard iteration
Letting and writing as the indicator matrix such that , we have
where the convexity/concavity results follow immediately from (Mariet & Sra, 2015, Lemma 2.3). Then, the update rule requires
which cannot be evaluated due to the term.
Appendix B Amazon Baby registries experiments
b.1 Amazon Baby Registries description
Registry  train size  test size  

health  62  5278  1320 
bath  100  5510  1377 
apparel  100  6482  1620 
bedding  100  7119  1780 
diaper  100  8403  2101 
gear  100  7089  1772 
feeding  100  10,090  2522 
b.2 Experimental results
In Tab. 4(a), we compare the performance of the various algorithms with rank . The regularization strength is set to its optimal value for the LowRank algorithm, and . This allows us to compare the LR algorithm to its “augmented” negative versions without hyperparameter tuning. As Prod performs much worse than LowRank, it is not included in further experiments.
We evaluate the precision at as
Improvement over LowRank  

Metric  LowRank  Dyn  Exp  NCE 
MPR  70.50  0.92 0.56  0.68 0.62  0.86 0.55 
p@1  9.96  0.67 0.75  0.58 0.76  0.20 1.75 
p@5  25.36  1.04 0.82  0.78 0.67  0.67 1.09 
p@10  36.50  1.39 0.85  1.13 0.79  0.97 1.18 
p@20  51.22  1.38 0.97  1.28 1.11  1.35 1.20 
AUC  0.630  0.027 0.017  0.026 0.016  0.009 0.017 
Compared to traditional SGA methods, algorithms that use inferred negatives perform (Prod excepted) better across all metrics and datasets. Dyn and Exp provide consistent improvements compared to the other methods, whereas NCE shows a higher variance and slightly worse performance. Improvements observed using Dyn and Exp are larger than the loss in performance due to going from fullrank to lowrank kernels reported in (Gartrell et al., 2017).
Finally, we also compared all methods when tuning both the regularization and the negative to positive ratio , but did not see any significant improvements. As this suggests there is no need to do additional hyperparameter tuning when using CE, we fix for all experiments.