Learning Determinantal Point Processes by Sampling Inferred Negatives

02/15/2018 ∙ by Zelda Mariet, et al. ∙ Criteo MIT 0

Determinantal Point Processes (DPPs) have attracted significant interest from the machine-learning community due to their ability to elegantly and tractably model the delicate balance between quality and diversity of sets. We consider learning DPPs from data, a key task for DPPs; for this task, we introduce a novel optimization problem, Contrastive Estimation (CE), which encodes information about "negative" samples into the basic learning model. CE is grounded in the successful use of negative information in machine-vision and language modeling. Depending on the chosen negative distribution (which may be static or evolve during optimization), CE assumes two different forms, which we analyze theoretically and experimentally. We evaluate our new model on real-world datasets; on a challenging dataset, CE learning delivers a considerable improvement in predictive performance over a DPP learned without using contrastive information.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Careful selection of items from a large collection underlies many machine learning applications. Notable examples include recommender systems, information retrieval and automatic summarization methods, among others. Typically, the selected set of items must fulfill a variety of application specific requirements—e.g., when recommending items to a user, the

quality of each selected item is important. This quality must be, however, balanced by the diversity of the selected items to avoid redundancy within recommendations.

But balancing quality with diversity is challenging: as the collection size grows, the number of its subsets grows exponentially. A model that offers an elegant, tractable way to achieve this balance is a Determinantal Point Process (Dpp). Concretely, a Dpp models a distribution over subsets of a ground set that is parametrized by a semi-definite matrix , such that for any ,

(1)

where is the submatrix of indexed by . Informally, represents the volume associated with subset , the diagonal entry represents the importance of item , while entry encodes similarity between items and . Since the normalization constant of (1) is simply , we have , which suggests why Dpps may be tractable despite their exponentially large sample space.

The key object defining a Dpp is its kernel matrix . This matrix may be fixed a priori using domain knowledge (Borodin, 2009), or as is more common in machine learning applications, learned from observations using maximum likelihood estimation (MLE) (Gillenwater et al., 2014; Mariet & Sra, 2015). However, while fitting observed subsets well, MLE for Dpps may also assign high likelihoods to unobserved subsets far from the underlying generative distribution Chao et al. (2015). MLE-based Dpp models may thus have modes corresponding to subsets that are close in likelihood, yet differ in how close they are to the true data distribution. Such confusable modes reduce the quality of the learned model, hurting predictions (see Figure 1).

Such concerns when learning generative models over huge sample spaces are not limited to the area of subset-selection: applications in image and text generation have been the driving force in developing techniques for generating high-quality samples. Among their innovations, a particularly successful technique uses generated samples as “negative samples” to train a discriminator, which in turn encourages generation of more realistic samples; this is the key idea behind the Generative Adversarial Nets (GANs) introduced in 

(Goodfellow et al., 2014).

These observations motivate us to investigate the use of Dpp-generated samples with added perturbations as negatives, which we then incorporate into the learning task to improve the modeling power of Dpps. Intuitively, negative samples are those subsets that are far from the true data distribution, but to which the Dpp

erroneously assigns high probability. As there is no closed form way to generate such idealized negatives, we approximate them via an external “negative distribution”.

More precisely, we introduce a novel Dpp learning problem that incorporates samples from a negative distribution into traditional MLE. While the focus of our work is on generating the negative distribution jointly with , we also investigate outside sources of negative information. Ultimately, our formulation leads to an optimization problem harder than the original Dpp learning problem; we show that even approximate solutions greatly improve the performance of the Dpp model when evaluated on concrete tasks, such as identifying the best item to add to a subset of chosen objects (basket-completion) and discriminating between held-out test data and randomly generated subsets.

(a) Probabilities of making the right prediction
(b) KL divergences
Figure 1: Results for experiments on a synthetic toy dataset. This toy dataset was generated by replicating the baskets {1, 2} and {3, 4} 1000 times each. We randomly select 80% of this dataset for training, and 20% for test. We train each model to convergence, and then compute the next-item predictive probabilities for each unique pair, along with the symmetric KL divergence (only over areas of shared support) between the predictive and empirical next-item distributions. Net symmetric KL divergence is computed by adding the symmetric KL divergences for each of the two unqiue baskets. Experiments were run 10 times, with set to the optimal value for each model; is set to its optimal LowRank value. See Section 3 for details on the Dyn and Exp negative sampling models. We see that the LR (low-rank Dpp

) model can assign relatively high predictive probabilities to modes that represent incorrect predictions, resulting in higher symmetric KL divergence compared to the true empirical distribution and a high variance, revealing a high sensitivity to initalization. The

Dyn and Exp methods we introduce in Section 3 reduce this confusable mode issue, resulting in predictive distributions that are closer to the true distribution and much smaller variances.

Contributions.

To our knowledge, this work is the first theoretical or empirical investigation of augmenting the Dpp learning problem with negative information.

  • Our first main contribution is the Contrastive Estimation (CE) model, which incorporates negative information through inferred negatives into the learning task.

  • We introduce static and dynamic models for CE and discuss the theoretical and practical trade-offs of such choices. Static models leverage information that does not evolve over time, whereas dynamic models draw samples from a negative distribution that depends on the current model’s parameters; dynamic CE posits an optimization problem worthy of independent study.

  • We show how to learn CE models efficiently, and furthermore show that the complexity of conditioning a Dpp on a chosen sample can be brought from to essentially . This helps dynamic CE and removes a major bottleneck in computing next-item predictions for a set.

Using findings obtained from extensive experiments conducted on small datasets, we show on a large dataset that CE learning significantly improves the modeling power of Dpps: CE learning improves Dpp performance for next-item basket completion, as well as Dpp discriminative power, as evaluated by the model’s ability to distinguish held-out test data from randomly generated subsets.

We present a review of related work in Section 2. In Section 3, we introduce Contrastive Estimation and its dynamic and static variants. We discuss how the CE problem can be optimized efficiently in Section 4, as well as how Dpp conditioning for basket-completion predictions can be performed with improved complexity. In Section 5, we show that CE learning leads to remarkable empirical improvements of Dpp performance metrics.

2 Background and related work

First introduced to model fermion behavior by Macchi (1975), Dpps have gained popularity due to their elegant balancing of quality and subset diversity. Dpps are studied both for their theoretical properties (Kulesza & Taskar, 2012; Borodin, 2009; Affandi et al., 2014; Kulesza, 2013; Gillenwater, 2014; Decreusefond et al., 2015; Lavancier et al., 2015) and their machine learning applications: object retrieval (Affandi et al., 2014), summarization (Lin & Bilmes, 2012; Chao et al., 2015), sensor placement (Krause et al., 2008), recommender systems (Gartrell et al., 2016)

, neural network compression 

(Mariet & Sra, 2016a), and minibatch selection (Zhang et al., 2017).

Gillenwater et al. (2014) study Dpp kernel learning via EM, while Mariet & Sra (2015) present a fixed-point method. Dpp kernel learning has leveraged Kronecker (Mariet & Sra, 2016b) and low-rank (Dupuy & Bach, 2016; Gartrell et al., 2017) structures. Learning guarantees using Dpp graph properties are studied in (Urschel et al., 2017). Aside from Tschiatschek et al. (2016); Djolonga et al. (2016), who learn a Facility LocatIon Diversity (FLID) distribution (as well as more complex FLIC and FLDC models) by contrasting it with a “negative” product distribution, little attention has been given to using negative samples to learn richer subset-selection models.

Nonetheless, leveraging negative information is a widely used in other applications. In object detection, negative mining corrects for the skewed simple-to-difficult negative distribution by training the model on its false positives 

(Sung, 1996; Canévet & Fleuret, 2014; Shrivastava et al., 2016)

. In language modeling, Noise Contrastive Estimation (NCE) 

(Gutmann & Hyvärinen, 2012), which tasks the model with distinguishing positive samples from generated negatives, was first applied in (Mnih & Teh, 2012) and has been instrumental in Word2Vec (Mikolov et al., 2013). Since then, variants using adaptive noise (Chen et al., 2017) have been introduced. NCE is also the method used by Tschiatschek et al. (2016) for subset-selection.

An alternate approach to negative samples within submodular language models was introduced as Contrastive Estimation in (Smith & Eisner, 2005a, b). Negative sampling is also used in GANs (Goodfellow et al., 2014), where a generator network competes with a discriminative network which distinguishes between positives and generated negatives. An adversarial approach to Contrastive Estimation has been recently introduced in (Bose et al., 2018), where ideas from GANs for discrete data are used to implement an adversarial negative sampler that augments a conventional negative sampler.

3 Learning Dpps with negative samples

Motivated by the similarities between Dpp learning and crucial structured prediction problems in other ML fields, we introduce an optimization problem that leverages negative information. We refer to this problem as Contrastive Estimation (CE) due to its ties to a notion discussed in (Smith & Eisner, 2005a).

3.1 Contrastive Estimation

In conventional Dpp learning, we seek to maximize determinantal volumes of sets drawn from the true distribution (that we wish to model), by solving the following MLE problem, where samples in the training set are assumed to be drawn i.i.d.:

Find (2)

We augment problem (2) to incorporate additional information from a negative distribution , which we wish to have the Dpp distribution move away from. The ensuing optimization problem is the main focus of our paper.

Definition 1 (Contrastive Estimation).

Given a training set of positive samples on which is defined and a negative distribution over , we call Contrastive Estimation the problem

(3)

where we write .

The expectation can be approximated by drawing a set of samples from : then becomes111With a slight abuse of notation, we continue writing despite the sample approximation to .

(4)

If , the CE objective (3) reduces to . Conversely, can be viewed as a sample-based approximation of the value , where is the true distribution generating the samples in . Interestingly, another reformulation of (3) suggests an even broader class of Dpp kernel learning: indeed, let be (resp. ) for (resp. ), and define

where the should be viewed as belonging in with an additional normalization coefficient. Then, we can rewrite equation (4) in the following form

(5)

Formulation (5) suggests the use of a broader scope of continuous labels ; we do not cover this variation in the present work, but note that (5) permits the use of weighted samples for learning.

Remark 1.

Compared to the traditional Noise Contrastive Estimation (NCE) approach, which requires full knowledge of the negative distribution, CE does not suffer any such limitation: we only require an estimate of .

Remark 2.

Eq. (3) can be made to go to with pathological negative samples (i.e. ); hence, choosing the negative distribution is a crucial concern for CE. In practice, we do not observe this pathological behavior (cf. Section 5).

Remark 3.

CE is a non-convex optimization problem, and thus admits the same guarantees as Dpp MLE learning when learned using Stochastic Gradient Ascent with decreasing step sizes; however, the convergence rate will depend on the choice of .

Indeed, to fully specify the CE problem one must first choose the negative distribution , or equivalently, choose a procedure to generate negative samples to obtain (4). We consider below two classes of distributions with considerably different ramifications: dynamic and static negatives; their analysis is the focus of the next two sections.

3.2 Dynamic negatives

In most applications leveraging negative information (e.g., negative mining, GANs), negative samples evolve over time based on the state of the learned model. We call any that depends on the state of the model a dynamic negative distribution: at iteration of the learning procedure with kernel estimate , we use a parametrized by .

More specifically, we focus on the setting where negative samples themselves are generated by the current Dpp, with the goal of reducing overfitting. Given a positive sample , we generate a negative by replacing with that yields a high probability (Alg. 1). We generate the samples probabilistically rather than via mode maximization so that a sample can lead to different negatives when we generate more negatives than positives.

  Input: Positive sample , current kernel
  Sample prop. to its empirical probability in
  
  Sample w.p. proportional to
  
   return
Algorithm 1 Generate dynamic negative

As evolves along with , the second term of acts as a moving target that must be continuously estimated during the learning procedure. For this reason, we choose to optimize by a two-step procedure described in Alg. 2, similarly to an alternating maximization approach such as EM.

  Input: Positive samples , initial kernel , maxIter.
  
  while  maxIter and not converged do
     , )
     
  end while
   return
Algorithm 2 Optimizing dynamic CE

Note that this approach bears strong similarities with GANs, in which both the generator and discriminator evolve during training (dynamic negatives also appear in a discussion by Goodfellow (2014) as a theoretical tool to analyze the difference between NCE and GANs).

Once the generated negative has been used in an iteration of the optimization of , it is less likely to be sampled again.222If happens to be a false negative (i.e. appears in ), will be comparatively sampled more frequently as a positive, and so will contribute on average as a positive sample. Additional precautions such as the ones mentioned in (Bose et al., 2018) can also be leveraged if necessary. Crucially, such dynamic negatives also avoid the problem alluded to in Remark 2, since by construction they have a non-zero probability under at iteration .

3.3 Static negatives

Conversely, we can simplify the optimization problem by considering a static negative distribution: does not depend on the current kernel estimate. A considerable theoretical advantage of static negatives lies in the simpler optimization problem: given a static negative distribution , the optimization objective

does not evolve during training, and is amenable to a simple invocation of stochastic gradient descent 

(Bottou, 1998).

Theorem 1.

Let be a static distribution over and let be such that . Let be a bounded subspace of all positive semi-definite matrices of rank . Projected stochastic gradient ascent applied to the CE objective with negative distribution and space with step sizes such that , will converge to a critical point.

Note, however, that such distributions may suffer from the fundamental theoretical issue in Rem. 2, and hence careful attention must be paid to ensure that the learning algorithm does not converge to a spurious optimum that assigns a probability to . In practice, we observed that the local nature of stochastic gradient ascent iterations was sufficient to avoid such behavior.

Let us now discuss two classical choices for fixed .

Product negatives.

A common choice of negative distribution in other machine learning areas is the product distribution, which is the standard “noise” distribution used in NCE. It is defined by

(6)

where is the empirical probability of in . Although (Mikolov et al., 2013) reports better results by raising the to the power , we did not observe any improvements when using exponentiated power distributions; for this reason, by product negatives, we always indicate the baseline distribution (6).

The product distribution is in practice a mismatch for Dpps, as it lacks the negative association property of Dpps which enables them to model the repulsive interactions between similar items333Dpps belong to the family of Strongly Rayleigh measures, which have been shown to verify a broad range of negatively associated properties; we refer the interested reader to the fascinating work (Pemantle, 2000; Borcea et al., 2009a, b; Borcea & Brändén, 2009a, b, 2010)..

Explicit negatives.

Alternatively, we may have prior knowledge of a class of subsets that our model should not generate. For example, we might know that items and are negatively correlated and hence unlikely to co-occur. We may also learn via user feedback that some generated subsets are inaccurate. We refer to negatives obtained using such outside information as explicit negatives.

A fundamental advantage of explicit negatives is that they allow us to incorporate prior knowledge and user feedback as part of the learning algorithm. The ability to incorporate such information, to our knowledge, is in itself a novel contribution to Dpp learning.

Although such knowledge may be costly and/or only available at rare intervals, a form of continuous learning that would regularly update the state of our prior knowledge (and hence ) would bring the explicit negative distribution into the realm of dynamic distributions, as described by Alg. 2.

4 Efficient learning and prediction

We now describe how the Contrastive Estimation problem for Dpps can be optimized efficiently. In order to efficiently generate dynamic negatives, which rely on Dpp conditioning, we additionally generalize the dual transformation leveraged in (Osogami et al., 2018) to speed up basket-completion tasks with Dpps. This speed-up impacts the broader use of Dpps, outside of CE learning.

4.1 Optimizing

We propose to optimize the CE problem by exploiting a low-rank factorization of the kernel, writing , where and is the rank of the kernel, which is fixed a priori.

This factorization ensures that the estimated kernel remains positive semi-definite, and enables us to leverage the low-rank computations derived in (Gartrell et al., 2017) and refined in (Osogami et al., 2018). Given the similar forms of the MLE and CE objectives, we use the traditional stochastic gradient ascent algorithm introduced by (Gartrell et al., 2017) to optimize (3). In the case of dynamic negatives, we re-generate after each gradient step; less frequent updates are also possible if the negative generation algorithm is very costly.

We furthermore augment with a regularization term , defined as

where counts the occurrences of in the training set,

is the corresponding row vector of

and

is a tunable hyperparameter. Note that this is the same regularization as introduced in 

(Gartrell et al., 2017). This regularization tempers the strength of , a term interpretable as to the popularity of item  (Kulesza & Taskar, 2012; Gillenwater, 2014), based on its empirical popularity . Experimentally, we observe that adding has a strong impact on the predictive quality of our model.

The reader may wonder if other approaches to Dpp learning are also applicable to the CE problem.

Remark 4.

Gradient ascent algorithms require that the estimate be projected onto the space of positive semi-definite matrices; however, doing so can lead to almost-diagonal kernels (Gillenwater et al., 2014) that cannot model negative interactions. Riemannian gradient ascent methods were considered, but deemed too computationally demanding by (Mariet & Sra, 2015). Furthermore, the update rule for the fixed-point approach in (Mariet & Sra, 2015) does not admit a closed form solution for CE, rendering it impractical (App. A).

The low-rank formulation allows us to apply CE (as well as NCE, as discussed in Section 5) to learn large datasets such as the Belgian retail supermarket dataset (described in Section 5) without prohibitive learning runtimes. We show below that by leveraging the idea described in (Osogami et al., 2018), the low-rank formulation can also lead to additional speed ups during prediction.

4.2 Efficient conditioning for predictions

Dynamic negatives rely upon conditioning a Dpp on a chosen sample (see Alg. 1: can be efficiently computed for all by a preprocessing step that conditions on set ). For this reason, we now describe how low-rank Dpp conditioning can be significantly sped up.

In (Gartrell et al., 2017), conditioning has a cost of , where . Since for many datasets, this represents a significant bottleneck for conditioning and computing next-item predictions for a set. We show here that this complexity can be brought down significantly.

Proposition 1.

Given and a Dpp of rank parametrized by , where , we can derive the conditional marginal probabilities in the Dpp parametrization in only time.

Proof.

Let be the low-rank parametrization of the Dpp kernel () and . As in Gillenwater (2014), we first compute the dual kernel , where . We then compute

with , and where is the Dpp kernel conditioned on the event that all items in are observed, and is the restriction of to the rows and columns indexed by .

Computing costs . Next, following Kulesza & Taskar (2012), we eigendecompose to compute the conditional (marginal) probability of every possible item in :

where is column vector for item in and ,

are an eigenvalue/vector of

.

The computational complexity for computing the eigendecomposition is , and computing for all items in costs . Therefore, we have an overall computational complexity of for computing next-item conditionals/predictions for the low-rank Dpp using the dual kernel, which is significantly superior to the typical cost of . ∎

As in most cases , this represents a substantial improvement, allowing us condition in time essentially linear in the size of the item catalog.

5 Experiments

(a) UK dataset
Improvement over LowRank
Metric LowRank Exp Dyn
MPR 80.07 3.75 0.16 3.74 0.16
AUC 0.57297 0.41465 0.01334 0.41467 0.01339
(b) Belgian dataset
Improvement over LowRank
LowRank Exp Dyn
79.42 9.58 0.15 9.64 0.13
0.6162 0.3705 8.569e-5 0.3702 4.447e-5
Table 1: Results over the UK and Belgian datasets. Both explicit and dynamic CE obtain statistically significant improvements in MPR and AUC metrics, confirming that CE learning enhances recommender value of the model and its ability to distinguish data drawn from the target distribution from fake samples. The impact on precisions@ metrics is not reported as we did not observe statistically significant deviations from LowRank performance.

We run next-item prediction and AUC-based classification experiments444All code is implemented in Julia 0.6.1 and will be made publicly available upon publication. on two recommendation datasets for Dpp evaluation: the UK retail dataset (Chen, 2012), which after clipping all subsets to a maximum size555This allows us to use a low-rank matrix factorization for the Dpp that scales well in terms of train and prediction time. of 100, contains 4070 items and 20059 subsets, and the Belgian Retail Supermarket dataset666http://fimi.ua.ac.be/data/retail.pdf, which contains 88,163 subsets, of a total of 16,470 unique items (Brijs et al., 1999; Brijs, 2003). We compare the following Contrastive Estimation approaches:

  • Exp: explicit negatives learned with CE. As to our knowledge there are no datasets with explicit negative information, we generate approximations of explicit negatives by removing one item from a positive sample and replacing it with the least likely item (Algorithm 3).

  • Dyn: dynamic negatives learned with CE.

As our work revolves around improving Dpp performance, we focus on the two following baselines, which are targeted to learning Dpp parametrizations from data:

  • NCE: Noise Contrastive Estimation using product negatives.

  • LowRank: the standard low-rank Dpp stochastic gradient ascent algorithm from (Gartrell et al., 2017).

NCE learns a model by contrasting with negatives drawn from a “noisy” distribution , training the model to distinguish between sets drawn from and sets drawn from . NCE has gained popularity due to its ability to model distributions with untractable normalization coefficients, and has been shown to be a powerful technique to improve submodular recommendation models (Tschiatschek et al., 2016). NCE learns by maximizing the following conditional log-likelihood:

(7)

In our experiments, we learn the NCE objective with stochastic gradient ascent for our low-rank model, since is given by

(8)

where if and 0 otherwise.

  input: Positive sample
  Sample w.p.
  Sample w.p. .
   return
Algorithm 3 Approximate explicit negative generation

This allows us to approximate true explicit negatives, as we use the empirical data to derive “implausible” sets. Note, however, that when using such negatives we have no guarantee that objective function will be well behaved, as opposed to the theoretically grounded dynamic negatives.

5.1 Experimental setup

The performance of all methods are compared using standard recommender system metrics: Mean Percentile Rank (MPR). MPR is a recall-based metric which evaluates the model’s predictive power by measuring how well it predicts the next item in a basket, and is a standard choice for recommender systems (Hu et al., 2008; Li et al., 2010).

Specifically, given a set , let . The percentile rank of an item given a set is defined as

The MPR is then computed as

where is the set of test instances and is a randomly selected element in each set . An MPR of 50 is equivalent to random selection; a MPR of 100 indicates that the model perfectly predicts the held out item.

We also evaluate the discriminative power of each model using the AUC metric. For this task, we generate a set of negative subsets uniformly at random. For each positive subset in the test set, we generate a negative subset of the same length by drawing samples uniformly at random, while ensuring that the same item is not drawn more than once for a subset. We then compute the AUC for the model on these positive and negative subsets, where the score for each subset is the log-likelihood that the model assigns to the subset. This task measures the ability of the model to discriminate between positive subsets (ground-truth subsets) and randomly generated subsets.

In all experiments, 80% of subsets are used for training; the remaining 20% served as test; convergence is reached when the relative change in the validation log-likelihood is below a pre-determined threshold , set identically for all methods. All results are averaged over 5 learning trials.

5.2 Amazon registries

We conducted an experimental analysis on the largest 7 sub-datasets included in the Amazon Registry dataset, which has become a standard dataset for Dpp modeling (Gillenwater et al., 2014; Mariet & Sra, 2015; Gartrell et al., 2017). Given the small size of these datasets (the largest has 100 items), these experiments serve only to provide insight into the general behavior of the baselines and CE methods as well as the influence of the hyperparameters on convergence.

Table 2 reports the average time to convergence for each method. As generating the dynamic negatives has a high complexity due to Dpp conditioning, Dyn is 2.7x slower than Exp. LowRank is the fastest method, as it does not need to process any negatives. NCE is by far the most time-consuming.

Method LowRank Exp Dyn NCE
Runtime 0.83 0.54 2.69 0.02 7.13 0.28 27.59 2.20
Table 2: Runtime to convergence (s) on the feeding Amazon registry (, , ).

We found that explicit and dynamic CE are not very sensitive to the and hyperparameters. For this reason, in all further results, we set and in all further experiments. In previous work on low-rank Dpp learning (Gartrell et al., 2017), was found to be a reasonably optimal value, ensuring a fair comparison between all methods.

Further experiments reporting the MPR, AUC and various precisions for the Amazon registries are described in App. B.

5.3 UK and Belgian Retail Datasets

Following (Gartrell et al., 2016), for both the UK and the belgian dataset, we set the rank of the kernel to be the size of the largest subset in the dataset (K=100 for the UK dataset, K=76 for the Belgian dataset): this optimizes memory costs while still modeling all ground-truth subsets. Based on our results on the smaller Amazon dataset, we fix and .

Finally, corroborating our timing results on the Amazon registry, we saw that one iteration of NCE required nearly 11 hours on the Belgian dataset (compared to 5 minutes for one iteration of CE). For this reason, we remove NCE as a baseline from all remaining experiments, as it is not feasible in the general case.

Tables 4 (a) and (b) summarize our results; the negative methods show significant MPR improvement over LowRank, with both Dyn and Exp

performing almost 10 points higher on the Belgian dataset, and 3 points higher on the UK dataset. This is a striking improvement, compounded by small standard deviations confirming that these results are robust to matrix initialization.

We also see a dramatic improvement over LowRank in AUC, with an improvement of approximately 0.41 for the UK dataset and 0.37 for the Belgian dataset, across both Dyn and Exp methods. Both Dyn and Exp perform quite well, with an AUC score of approximately 0.9864 or higher for both models. These results suggest that for larger datasets, CE can be effective at improving the discriminative power of the Dpp.

6 Conclusion and future work

We introduce the Contrastive Estimation (CE) optimization problem, which optimizes the difference of the traditional Dpp log-likelihood and the expectation of the Dpp model’s log-likelihood under a negative distribution . This increases the Dpp’s fit to the data while simultaneously incorporating inferred or explicit domain knowledge into the learning procedure.

CE lends itself to intuitively similar but theoretically different variants, depending on the choice of : a static leads to significantly faster learning but allows spurious optima; conversely, allowing to evolve along with model parameters limits overfitting at the cost of a more complex optimization problem. Optimizing dynamic CE is in of itself a theoretical problem worthy of independent study.

Additionally, we show that low-rank Dpp conditioning complexity can be improved by a factor of by leveraging the dual representation of the low-rank kernel. This not only improves prediction speed on a trained model, but allows for more efficient dynamic negative generation.

Experimentally, we show that CE with dynamic and explicit negatives provide comparable, significant improvements in the predictive performance of Dpps, as well as on the learned Dpp’s ability to discriminate between real and randomly generated subsets.

Our analysis also raises both theoretical and practical questions: in particular, a key component of future work lies in better understanding how explicit domain knowledge can be incorporated into the generating logic for both dynamic and static negatives. Furthermore, the CE formulation in Eq. (5) suggests the possibility of using continuous labels for weighted samples within CE.

Acknowledgements.

This work was partially supported by a Criteo Faculty Research Award, and NSF-IIS-1409802.

References

  • Affandi et al. (2014) Affandi, R., Fox, E., Adams, R., and Taskar, B. Learning the parameters of Determinantal Point Process kernels. In ICML, 2014.
  • Borcea & Brändén (2009a) Borcea, Julius and Brändén, Petter. The Lee-Yang and Pólya-Schur programs I Linear operators preserving stability. Inventiones mathematicae, 177(3), 2009a.
  • Borcea & Brändén (2009b) Borcea, Julius and Brändén, Petter. The Lee-Yang and Pólya-Schur programs II Theory of stable polynomials and applications. Communications on Pure and Applied Mathematics, 62(12), 2009b.
  • Borcea & Brändén (2010) Borcea, Julius and Brändén, Petter. Multivariate Pólya–Schur classification problems in the Weyl algebra. Proceedings of the London Mathematical Society, 101(1), 2010.
  • Borcea et al. (2009a) Borcea, Julius, Brändén, Petter, and Liggett, Thomas. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, 22(2), 2009a.
  • Borcea et al. (2009b) Borcea, Julius, Brändén, Petter, and Shapiro, Boris. Classification of hyperbolicity and stability preservers: the multivariate Weyl algebra case. arXiv preprint math.CA/0606360, 2009b.
  • Borodin (2009) Borodin, Alexei. Determinantal Point Processes. arXiv:0911.1153, 2009.
  • Bose et al. (2018) Bose, Avishek, Ling, Huan, and Cao, Yanshuai. Adversarial contrastive estimation. arXiv preprint arXiv:1805.03642, 2018.
  • Bottou (1998) Bottou, Léon. On-line learning in neural networks. Cambridge University Press, 1998.
  • Brijs (2003) Brijs, Tom. Retail market basket data set. In Workshop on Frequent Itemset Mining Implementations (FIMI’03), 2003.
  • Brijs et al. (1999) Brijs, Tom, Swinnen, Gilbert, Vanhoof, Koen, and Wets, Geert. Using association rules for product assortment decisions: A case study. In SIGKDD. ACM, 1999.
  • Canévet & Fleuret (2014) Canévet, Olivier and Fleuret, Francois. Efficient Sample Mining for Object Detection. In ACML, JMLR: Workshop and Conference Proceedings, 2014.
  • Chao et al. (2015) Chao, Wei-Lun, Gong, Boqing, Grauman, Kristen, and Sha, Fei. Large-margin Determinantal Point Processes. In

    Uncertainty in Artificial Intelligence (UAI)

    , 2015.
  • Chen (2012) Chen, D. Data mining for the online retail industry: A case study of rfm model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, 19(3), August 2012.
  • Chen et al. (2017) Chen, Long, Yuan, Fajie, Jose, Joemon M., and Zhang, Weinan. Improving negative sampling for word representation using self-embedded features. CoRR, abs/1710.09805, 2017.
  • Decreusefond et al. (2015) Decreusefond, Laurent, Flint, Ian, Privault, Nicolas, and Torrisi, Giovanni Luca. Determinantal Point Processes, 2015.
  • Djolonga et al. (2016) Djolonga, Josip, Tschiatschek, Sebastian, and Krause, Andreas. Variational inference in mixed probabilistic submodular models. In NIPS. Curran Associates, Inc., 2016.
  • Dupuy & Bach (2016) Dupuy, Christophe and Bach, Francis. Learning Determinantal Point Processes in sublinear time, 2016.
  • Gartrell et al. (2016) Gartrell, Mike, Paquet, Ulrich, and Koenigstein, Noam. Bayesian low-rank Determinantal Point Processes. In Sen, Shilad, Geyer, Werner, Freyne, Jill, and Castells, Pablo (eds.), RecSys. ACM, 2016.
  • Gartrell et al. (2017) Gartrell, Mike, Paquet, Ulrich, and Koenigstein, Noam. Low-rank factorization of Determinantal Point Processes. In AAAI, 2017.
  • Gillenwater (2014) Gillenwater, J. Approximate Inference for Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2014.
  • Gillenwater et al. (2014) Gillenwater, J., Kulesza, A., Fox, E., and Taskar, B. Expectation-maximization for learning Determinantal Point Processes. In NIPS, 2014.
  • Goodfellow et al. (2014) Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS. 2014.
  • Goodfellow (2014) Goodfellow, Ian J. On distinguishability criteria for estimating generative models, 2014.
  • Gutmann & Hyvärinen (2012) Gutmann, Michael U. and Hyvärinen, Aapo. Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J. Mach. Learn. Res., 13, February 2012. ISSN 1532-4435.
  • Hu et al. (2008) Hu, Yifan, Koren, Yehuda, and Volinsky, Chris. Collaborative filtering for implicit feedback datasets. In ICDM, 2008.
  • Krause et al. (2008) Krause, Andreas, Singh, Ajit, and Guestrin, Carlos. Near-optimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. JMLR, 9, 2008.
  • Kulesza (2013) Kulesza, A. Learning with Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2013.
  • Kulesza & Taskar (2012) Kulesza, A. and Taskar, B. Determinantal Point Processes for machine learning, volume 5. Foundations and Trends in Machine Learning, 2012.
  • Lavancier et al. (2015) Lavancier, Frédéric, Møller, Jesper, and Rubak, Ege. Determinantal Point Process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4), 2015.
  • Li et al. (2010) Li, Yanen, Hu, Jia, Zhai, ChengXiang, and Chen, Ye. Improving one-class collaborative filtering by incorporating rich user information. In CIKM, 2010.
  • Lin & Bilmes (2012) Lin, H. and Bilmes, J. Learning mixtures of submodular shells with application to document summarization. In UAI, 2012.
  • Macchi (1975) Macchi, O. The coincidence approach to stochastic point processes. Adv. Appl. Prob., 7(1), 1975.
  • Mariet & Sra (2015) Mariet, Zelda and Sra, Suvrit. Fixed-point algorithms for learning Determinantal Point Processes. In ICML, 2015.
  • Mariet & Sra (2016a) Mariet, Zelda and Sra, Suvrit. Diversity networks. Int. Conf. on Learning Representations (ICLR), 2016a.
  • Mariet & Sra (2016b) Mariet, Zelda and Sra, Suvrit. Kronecker Determinantal Point Processes. In NIPS, 2016b.
  • Mikolov et al. (2013) Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Greg S, and Dean, Jeff. Distributed representations of words and phrases and their compositionality. In NIPS. 2013.
  • Mnih & Teh (2012) Mnih, Andriy and Teh, Yee Whye. A fast and simple algorithm for training neural probabilistic language models. In ICML, 2012.
  • Osogami et al. (2018) Osogami, Takayuki, Raymond, Rudy, Goel, Akshay, Shirai, Tomoyuki, and Maehara, Takanori. Dynamic Determinantal Point Processes. In AAAI, 2018.
  • Pemantle (2000) Pemantle, Robin. Towards a theory of negative dependence. Journal of Mathematical Physics, 41(3), 2000.
  • Shrivastava et al. (2016) Shrivastava, Abhinav, Gupta, Abhinav, and Girshick, Ross. Training region-based object detectors with online hard example mining. In CVPR, 2016.
  • Smith & Eisner (2005a) Smith, Noah A. and Eisner, Jason. Guiding unsupervised grammar induction using contrastive estimation. In In Proc. of IJCAI Workshop on Grammatical Inference Applications, 2005a.
  • Smith & Eisner (2005b) Smith, Noah A. and Eisner, Jason. Contrastive estimation: Training log-linear models on unlabeled data. In ACL, ACL ’05, 2005b.
  • Sung (1996) Sung, Kah Kay. Learning and Example Selection for Object and Pattern Detection. PhD thesis, Massachusetts Institute of Technology, 1996.
  • Tschiatschek et al. (2016) Tschiatschek, Sebastian, Djolonga, Josip, and Krause, Andreas. Learning probabilistic submodular diversity models via noise contrastive estimation. In AISTATS, 2016.
  • Urschel et al. (2017) Urschel, John, Brunel, Victor-Emmanuel, Moitra, Ankur, and Rigollet, Philippe.

    Learning Determinantal Point Processes with moments and cycles.

    In ICML, 2017.
  • Zhang et al. (2017) Zhang, Cheng, Kjellström, Hedvig, and Mandt, Stephan. Stochastic learning on imbalanced data: Determinantal Point Processes for mini-batch diversification. CoRR, abs/1705.00607, 2017.

Appendix A Contrastive Estimation with the Picard iteration

Letting and writing as the indicator matrix such that , we have

where the convexity/concavity results follow immediately from (Mariet & Sra, 2015, Lemma 2.3). Then, the update rule requires

which cannot be evaluated due to the term.

Appendix B Amazon Baby registries experiments

b.1 Amazon Baby Registries description


Registry train size test size
health 62 5278 1320
bath 100 5510 1377
apparel 100 6482 1620
bedding 100 7119 1780
diaper 100 8403 2101
gear 100 7089 1772
feeding 100 10,090 2522
Table 3: Description of the Amazon Baby registries dataset.

b.2 Experimental results

In Tab. 4(a), we compare the performance of the various algorithms with rank . The regularization strength is set to its optimal value for the LowRank algorithm, and . This allows us to compare the LR algorithm to its “augmented” negative versions without hyper-parameter tuning. As Prod performs much worse than LowRank, it is not included in further experiments.

We evaluate the precision at as

Improvement over LowRank
Metric LowRank Dyn Exp NCE
MPR 70.50 0.92 0.56 0.68 0.62 0.86 0.55
p@1 9.96 0.67 0.75 0.58 0.76 0.20 1.75
p@5 25.36 1.04 0.82 0.78 0.67 0.67 1.09
p@10 36.50 1.39 0.85 1.13 0.79 0.97 1.18
p@20 51.22 1.38 0.97 1.28 1.11 1.35 1.20
AUC 0.630 0.027 0.017 0.026 0.016 0.009 0.017
Table 4: MPR, p@, and AUC values for LowRank, and baseline improvement over LowRank for other methods. Positive values indicate the algorithm performs better than LowRank, and bold values indicate improvement over LowRank that lies outside the standard deviation. Experiments were run 5 times, with ; is set to its optimal LowRank value.

Compared to traditional SGA methods, algorithms that use inferred negatives perform (Prod excepted) better across all metrics and datasets. Dyn and Exp provide consistent improvements compared to the other methods, whereas NCE shows a higher variance and slightly worse performance. Improvements observed using Dyn and Exp are larger than the loss in performance due to going from full-rank to low-rank kernels reported in (Gartrell et al., 2017).

Finally, we also compared all methods when tuning both the regularization and the negative to positive ratio , but did not see any significant improvements. As this suggests there is no need to do additional hyper-parameter tuning when using CE, we fix for all experiments.