Learning Nonsymmetric Determinantal Point Processes

05/30/2019 ∙ by Mike Gartrell, et al. ∙ Criteo 0

Determinantal point processes (DPPs) have attracted substantial attention as an elegant probabilistic model that captures the balance between quality and diversity within sets. DPPs are conventionally parameterized by a positive semi-definite kernel matrix, and this symmetric kernel encodes only repulsive interactions between items. These so-called symmetric DPPs have significant expressive power, and have been successfully applied to a variety of machine learning tasks, including recommendation systems, information retrieval, and automatic summarization, among many others. Efficient algorithms for learning symmetric DPPs and sampling from these models have been reasonably well studied. However, relatively little attention has been given to nonsymmetric DPPs, which relax the symmetric constraint on the kernel. Nonsymmetric DPPs allow for both repulsive and attractive item interactions, which can significantly improve modeling power, resulting in a model that may better fit for some applications. We present a method that enables a tractable algorithm, based on maximum likelihood estimation, for learning nonsymmetric DPPs from data composed of observed subsets. Our method imposes a particular decomposition of the nonsymmetric kernel that enables such tractable learning algorithms, which we analyze both theoretically and experimentally. We evaluate our model on synthetic and real-world datasets, demonstrating improved predictive performance compared to symmetric DPPs, which have previously shown strong performance on modeling tasks associated with these datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 8

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Determinantal point processes (DPPs) have attracted growing attention from the machine learning community as an elegant probablistic model for the relationship between items within observed subsets, drawn from a large collection of items. DPPs have been well studied for their theoretical properties Affandi et al. (2014); Borodin (2009); Decreusefond et al. (2015); Gillenwater (2014); Kulesza (2013); Kulesza and Taskar (2012); Lavancier et al. (2015), and have been applied to numerous machine learning applications, including document summarization Chao et al. (2015); Lin and Bilmes (2012), recommender systems Gartrell et al. (2016), object retrieval Affandi et al. (2014), sensor placement Krause et al. (2008), information retrieval Kulesza and Taskar (2011), and minibatch selection Zhang et al. (2017). Efficient algorithms for DPP learning Dupuy and Bach (2016); Gartrell et al. (2017); Gillenwater et al. (2014); Mariet and Sra (2015, 2016) and sampling Anari et al. (2016); Li et al. (2016); Rebeschini and Karbasi (2015) have been reasonably well studied. DPPs are conventionally parameterized by a positive semi-definite (PSD) kernel matrix, and due to this symmetric kernel, they are able to encode only repulsive interactions between items. Despite this limitation, symmetric DPPs have significant expressive power, and have proven effective in the aforementioned applications. However, the ability to encode only repulsive interactions, or negative correlations between pairs of items, does have important limitations in some settings. For example, consider the case of a recommender system for a shopping website, where the task is to provide good recommendations for items to complete a user’s shopping basket prior to checkout. For models that can only encode negative correlations, such as the symmetric DPP, it is impossible to directly encode positive interactions between items; e.g., a purchased basket containing a video game console would be more likely to also contain a game controller. One way to resolve this limitation is to consider nonsymmetric DPPs, which relax the symmetric constraint on the kernel.

Nonsymmetric DPPs allow the model to encode both repulsive and attractive item interactions, which can significantly improve modeling power. With one notable exception Brunel (2018), little attention has been given to nonsymmetric DPPs within the machine learning community. We present a method for learning fully nonsymmetric DPP kernels from data composed of observed subsets, where we leverage a low-rank decomposition of the nonsymmetric kernel that enables a tractable learning algorithm based on maximum likelihood estimation (MLE).

Contributions

Our work makes the following contributions:

  • We present a decomposition of the nonsymmetric DPP kernel that enables a tractable MLE-based learning algorithm. To the best of our knowledge, this is the first MLE-based learning algorithm for nonsymmetric DPPs.

  • We present a general framework for the theoretical analysis of the properties of the maximum likelihood estimator for a somewhat restricted class of nonsymmetric DPPs, which shows that this estimator has particular statistical guarantees regarding consistency.

  • Through an extensive experimental evaluation on several synthetic and real-world datasets, we highlight the significant improvements in modeling power that nonsymmetric DPPs provide in comparison to symmetric DPPs. We see that nonsymmetric DPPs are more effective at recovering correlation structure within data, particularly for data that contains large disjoint collections of items.

2 Background

A DPP models a distribution over subsets of a finite ground set that is parametrized by a matrix , such that for any ,

(1)

where is the submatrix of indexed by .

Since the normalization constant for Eq. 1 follows from the observation that , we have, for all ,

(2)

Without loss of generality, we will assume that , which we also denote by , where is the cardinality of .

It is common to assume that is a positive semi-definite matrix in order to ensure that

defines a probability distribution on the power set of

Kulesza and Taskar (2012). More generally, any matrix whose principal minors , are nonnegative, is admissible to define a probability distribution as in (2Brunel (2018); such matrices are called -matrices. Recall that any matrix can be decomposed uniquely as the sum of a symmetric matrix

and a skew-symmetric matrix

. Namely, whereas . The following lemma gives a simple sufficient condition on for to be a -matrix.

Lemma 1.

Let be an arbitrary matrix. If is PSD, then is a -matrix.

An important consequence is that a matrix of the form , where is diagonal with positive diagonal entries and is skew-symmetric, is a -matrix. Such a matrix would only capture nonnegative correlations, as explained in the next section.

2.1 Capturing Positive and Negative Correlations

When DPPs are used to model real data, they are often formulated in terms of the matrix as described above, called an -ensemble. However, DPPs can be alternatively represented in terms of the matrix , where . Using the representation,

(3)

where is a random subset drawn from . is called the marginal kernel; since here we are defining marginal probabilities that don’t need to sum to 1, no normalization constant is needed. DPPs are conventionally parameterized by a PSD or matrix, which is symmetric.

However, and need not be symmetric. As shown in Brunel (2018), is admissible if and only if is a matrix, that is, all of its principal minors are nonnegative. The class of matrices is much larger, and allows us to accommodate nonsymmetric and matrices. To enforce the constraint on during learning, we impose the decomposition of described in Section 4. Since we see as consequence of Lemma 1 that the sum of a PSD matrix and a skew-symmetric matrix is a matrix, this allows us to support nonsymmetric kernels, while ensuring that is a matrix. As we will see in the following, there are significant advantages to accommodating nonsymmetric kernels in terms of modeling power.

As shown in Kulesza and Taskar (2012)

, the eigenvalues of

are bounded above by one, while need only be a PSD or matrix. Furthermore, gives the marginal probabilities of subsets, while directly models the atomic probabilities of observed each subset of . For these reasons, most work on learning DPPs from data uses the representation of a DPP.

If is a singleton set, then . The diagonal entries of directly correspond to the marginal inclusion probabilities for each element of . If is a set containing two elements, then we have

(4)

Therefore, the off-diagonal elements determine the correlations between pairs of items; that is, . For a symmetric , the signs and magnitudes of and are the same, resulting in . We see that in this case, the off-diagonal elements represent negative correlations between pairs of items, where a larger value of leads to a lower probability of and co-occurring, while a smaller value of indicates a higher co-occurrence probability. If , then there is no correlation between this pair of items. Since the sign of the term is always nonpositive, the symmetric model is able to capture only nonpositive correlations between items. In fact, symmetric DPPs induce a strong negative dependence between items, called negative association Borcea et al. (2009).

For a nonsymmetric , the signs and magnitudes of and may differ, resulting in . In this case, the off-diagonal elements represent positive correlations between pairs of items, where a larger value of leads to a higher probability of and co-occurring, while a smaller value of indicates a lower co-occurrence probability. Of course, the signs of the off-diagonal elements for some pairs may be the same in a nonsymmetric , which allows the model to also capture negative correlations. Therefore, a nonsymmetric can capture both negative and positive correlations between pairs of items.

3 General guarantees in maximum likelihood estimation for DPPs

In this section we define the log-likelihood function and we study the Fisher information of the model. The Fisher information controls whether the maximum likelihood, computed on iid samples, will be a -consistent. When the matrix is not invertible (i.e., if it is only a -matrix and not a -matrix), the support of , defined as the collection of all subsets such that , depends on , and the Fisher information will not be defined in general. Hence, we will assume, in this section, that is invertible and that we only maximize the log-likelihood over classes of invertible matrices .

Consider a subset of the set of all -matrices of size . Given a collection of observed subsets composed of items from , our learning task is to fit a DPP kernel based on this data. For all , the log-likelihood is defined as

(5)

where is the proportion of observed samples that equal .

Now, assume that are iid copies of a DPP with kernel . For all , the population log-likelihood is defined as the expectation of , i.e.,

(6)

where .

The maximum likelihood estimator (MLE) is defined as a minimizer of over the parameter space . Since can be viewed as a perturbed version of , it can be convenient to introduce the space defined as the linear subspace of spanned by and define the successive derivatives of and as multilinear forms on . As we will see later on, the complexity of the model can be captured in the size of the space . The following lemma provides a few examples. We say that a matrix is a signed matrix if for all , for some .

Lemma 2.
  1. If is the set of all positive definite matrices, it is easy to see that is the set of all symmetric matrices.

  2. If is the set of all -matrices, then .

  3. If is the collection of signed -matrices, then .

  4. If is the set of -matrices of the form , where is a symmetric matrix and is a skew-symmetric matrix (i.e., ), then .

  5. If is the set of signed -matrices with known signed pattern (i.e., there exists such that for all and all , ), then is the collection of all signed matrices with that same sign pattern. In particular, if is the set of all -matrices of the form where is diagonal and is skew-symmetric, then is the collection of all matrices that are the sum of a diagonal and a skew-symmetric matrix.

It is easy to see that the population log-likelihood is infinitely many times differentiable on the relative interior of and that for all in the relative interior of and ,

(7)

and

(8)

Hence, we have the following theorem. The case of symmetric kernels is studied in Brunel et al. (2017) and the following result is a straightforward extension to arbitrary parameter spaces. For completeness, we include the proof in the appendix. For a set , we call the relative interior of its interior in the linear space spanned by .

Theorem 1.

Let be a set of -matrices and let be in the relative interior of . Then, for all , . Moreover, the Fisher information is the negative Hessian of at and is given by

(9)

where is a DPP with kernel .

It follows that the Fisher information is positive definite if and only if any that verifies

(10)

must be . When is the space of symmetric and positive definite kernels, the Fisher information is definite if and only if is irreducible, i.e., it is not block-diagonal up to a permutation of its rows and columns Brunel et al. (2017). In that case, it is shown that the MLE learns at the speed . In general, this property fails and even irreducible kernels can induce a singular Fisher information.

Lemma 3.

Let be a subset of -matrices.

  1. Let and satisfy (10). Then, for all , .

  2. Let with . Let be such that and satisfy the following property: such that , . Then, if satisfies (10), .

  3. Let be block diagonal. Then, any supported outside of the diagonal blocks of satisfies (10).

In particular, this lemma implies that if is a class of signed -matrices with prescribed sign pattern (i.e., for all and all , where the ’s are and do not depend on ), then if lies in the relative interior of and has no zero entries, the Fisher information is definite.

In the symmetric case, it is shown in Brunel et al. (2017) that the only matrices satisfying (10) must be supported off the diagonal blocks of , i.e., the third part of Lemma 3 is an equivalence. In the appendix, we provide a few very simple counterexamples that show that this equivalence is no longer valid in the nonsymmetric case.

4 Model

To add support for positive correlations to the DPP, we consider nonsymmetric matrices. In particular, our approach involves incorporating a skew-symmetric perturbation to the PSD .

Recall that any matrix can be uniquely decomposed as , where is symmetric and is skew-symmetric. We impose a decomposition on as , where and are low-rank matrices, and we use a low-rank factorization of , , where is a low-rank matrix, as described in Gartrell et al. (2017), which also allows us to enforce to be PSD and hence, to be a -matrix by Lemma 1.

We define a regularization term, , as

(11)

where counts the number of occurrences of item in the training set, , , and

are the corresponding row vectors of

, , and , respectively, and

are tunable hyperparameters. This regularization formulation is similar to that proposed in 

Gartrell et al. (2017). From the above, we have the full formulation of the regularized log-likelihood of our model:

(12)

The computational complexity of Eq. 12 will be dominated by computing the determinant in the second term (the normalization constant), which is . Furthermore, since , the computational complexity of computing the gradient of Eq. 12 during learning will be dominated by computing the matrix inverse in the gradient of the second term, , which is . Therefore, we see that the low-rank decomposition of the kernel in our nonsymmetric model does not afford any improvement over a full-rank model in terms of computational complexity. However, our low-rank decomposition does provide a savings in terms of the memory required to store model parameters, since our low-rank model has space complexity , while a full-rank version of this nonsymmetric model has space complexity . When and , which is typical in many settings, this will result in a significant space savings.

5 Experiments

We run extensive experiments on several synthetic and real-world datasets. Since the focus of our work is on improving DPP modeling power and comparing nonsymmetric and symmetric DPPs, we use the standard symmetric low-rank DPP as the baseline model for our experiments.

5.1 Datasets

We perform next-item prediction and AUC-based classification experiments on two real-world datasets composed of purchased shopping baskets:

  1. Amazon Baby Registries: This public dataset consists of 111,0006 registries or "baskets" of baby products, and has been used in prior work on DPP learning Gartrell et al. (2016); Gillenwater et al. (2014); Mariet and Sra (2015). The registries are collected from 15 different categories, such as "apparel", "diapers", etc., and the items in each category are disjoint.We evaluate our models on the popular apparel category.

    We also perform an evaluation on a dataset composed of the three most popular categories: apparel, diaper, and feeding. We construct this dataset, composed of three large disjoint categories of items, with a catalog of 100 items in each category, to highlight the differences in how nonsymmetric and symmetric DPPs model data. In particular, we will see that the nonsymmetric DPP uses positive correlations to capture item co-occurences within baskets, while negative correlations are used to capture disjoint pairs of items. In contrast, since symmetric DPPs can only represent negative correlations, they must attempt to capture both co-occuring items, and items that are disjoint, using only negative correlations.

  2. UK Retail: This is a public dataset Chen (2012) that contains 25,898 baskets drawn from a catalog of 4,070 items. This dataset contains transactions from a non-store online retail company that primarily sells unique all-occasion gifts, and many customers are wholesalers. We omit all baskets with more than 100 items, which allows us to use a low-rank factorization of the symmetric DPP () that scales well in training and prediction time, while also keeping memory consumption for model parameters to a manageable level.

  3. We also perform an evaluation on synthetically generated data. Our data generator allows us to explicitly control the item catalog size, the distribution of set sizes, and the item co-occurrence distribution. By controlling these parameters, we are able to empirically study how the nonsymmetric and symmetric models behave for data with a specified correlation structure.

5.2 Experimental setup and metrics

Next-item prediction involves identifying the best item to add to a subset of selected items (e.g., basket completion), and is the primary prediction task we evaluate.

We compute a next-item prediction for a basket by conditioning the DPP on the event that all items in are observed. As described in Gillenwater (2014), we compute this conditional kernel, , as , where , is the restriction of to the rows and columns indexed by , and consists of the rows and columns of . The computational complexity of this operation is dominated by the three matrix multiplications, which is .

We compare the performance of all methods using a standard recommender system metric: mean percentile rank (MPR). A MPR of 50 is equivalent to random selection; a MPR of 100 indicates that the model perfectly predicts the held out item. MPR is a recall-based metric which we use to evaluate the model’s predictive power by measuring how well it predicts the next item in a basket; it is a standard choice for recommender systems Hu et al. (2008); Li et al. (2010). See Appendix C for a formal description of how the MPR metric is computed.

We evaluate the discriminative power of each model using the AUC metric. For this task, we generate a set of negative subsets uniformly at random. For each positive subset in the test set, we generate a negative subset of the same length by drawing samples uniformly at random, and ensure that the same item is not drawn more than once for a subset. We compute the AUC for the model on these positive and negative subsets, where the score for each subset is the log-likelihood that the model assigns to the subset. This task measures the ability of the model to discriminate between observed positive subsets (ground-truth subsets) and randomly generated subsets.

For all experiments, a random selection of 80% of the baskets are used for training, and the remaining 20% are used for testing. We use a small held-out validation set for tracking convergence and tuning hyperparameters. Convergence is reached during training when the relative change in validation log-likelihood is below a pre-determined threshold, which is set identically for all models. We implement our models using PyTorch, and use the Adam 

Kingma and Ba (2015) optimization algorithm to train our models.

5.3 Results on synthetic datasets

We run a series of synthetic experiments to examine the differences between nonsymmetric and symmetric DPPs. In all of these experiments, we define an oracle that controls the generative process for the data. The oracle uses a deterministic policy to generate a dataset composed of positive baskets (items that co-occur) and negative baskets (items that don’t co-occur). This generative policy defines the expected normalized determinant, , for each pair of items, and a threshold that limits the maximum determinantal volume for a positive basket and the minimum volume for a negative basket. This threshold is used to compute AUC results for this set of positives and negatives. Note that the negative sets are used only during evaluation. For each experiment, in Figures 12, and 3, we plot a transformed version of the learned matrices for the nonsymmetric and symmetric models, where each element of this matrix is re-weighted by for the corresponding pair. For each plotted transformation of , a red element corresponds to a negative correlation, which will tend to result in the model predicting that the corresponding pair is negative pair. Black and green elements correspond to smaller and larger positive correlations, respectively, for the nonsymmetric model, and very small negative correlations for the symmetric model; the model will tend to predict that the corresponding pair is positive in these cases. We perform the AUC-based evaluation for each pair by comparing predicted by the model with the ground truth determinantal volume provided by the oracle; this task is equivalent to performing basket completion for the pair. In Figures 12, and 3, we show the prediction error for each pair, where green corresponds to low error, and red corresponds to high error.

Recovering positive examples for low-sparsity data

In this experiment we aim to show that the nonsymmetric model is just as capable as the symmetric model when it comes to learning negative correlations when trained on data containing few negative correlations and many positive correlations. We choose a setting where the symmetric model performs well. We construct a dataset that contains no large disjoint collections of items, with 100 baskets of size six, and a catalog of 100 items. To reduce the impact of negative correlations between items, we use a categorical distribution, with nonuniform event probabilities, for sampling the items that populate each basket, with a large coverage of possible item pairs. This logic ensures few negative correlations, since there is a low probability that two items will never co-occur. For the nonsymmetric DPP, the oracle expects the model to predict a low negative correlation, or a positive correlation, for a pair of products that have a high co-occurence probability in the data. The results of this experiment are shown in Figure 1. We see from the plots showing the transformed matrices that both the nonsymmetric and symmetric models recover approximately the same structure, resulting in similar error plots, and similar predictive AUC of approximately 0.8 for both models.

Recovering negative examples for high-sparsity data

We construct a more challenging scenario for this experiment, which reveals an important limitation of the symmetric DPP. The symmetric DPP requires a relatively high density of observed item pairs (positive pairs) in order to learn the negative structure of the data that describes items that do not co-occur. During learning, the DPP will maximize determinantal volumes for positive pairs, while the normalization constant maintains a representation of the global volume of the parameter space for the entire item catalog. For a high density of observed positive pairs, increasing the volume allocated to positive pairs will result in a decrease in the volume assigned to many negative pairs, in order to maintain approximately the same global volume represented by the normalization constant. For a low density of positive pairs, the model will not allocate low volumes to many negative pairs. This phenomenon affects both the nonsymmetric and symmetric models. Therefore, the difference in each model’s ability to capture negative structure within a low-density region of positive can be explained in terms of how each model maximizes determinatal volumes using positive and negative correlations. In the case of the symmetric DPP, the model can increase determinantal volumes by using smaller negative correlations, resulting in off-diagonal values that approach zero. As these off-diagonal parameters approach zero, this behavior has the side effect of also increasing the determinantal volumes of subsets within disjoint groups, since these volumes are also affected by these small parameter values. In contrast, the nonsymmetric model behaves differently; determinantal volumes can be maximized by switching the signs of the off-diagonal entries of and increasing the magnitude of these parameters, rather than reducing the values of these parameters to near zero. This behavior allows the model to assign higher volumes to positive pairs than to negative pairs within disjoint groups in many cases, thus allowing the nonsymmetric model to recover disjoint structure.

In our experiment, the oracle controls the sparsity of the data by setting the number of disjoint groups of items; positive pairs within each disjoint group are generated uniformly at random, in order to focus on the effect of disjoint groups. For the AUC evaluation, negative baskets are constructed so that they contain items from at least two different disjoint groups. When constructing our dataset, we set the number of disjoint groups to 14, with 100 baskets of size six, and a catalog of 100 items. The results of our experiment are shown in Figure 2. We see from the error plot that the symmetric model cannot effectively learn the structure of the data, leading to high error in many areas, including within the disjoint blocks; the symmetric model provides an AUC of 0.5 as a result. In contrast, the nonsymmetric model is able to approximately recover the block structure, resulting in an AUC of 0.7.

Recovering positive examples for data that mixes disjoint sparsity with popularity-based positive structure

For our final synthetic experiment, we construct a scenario that combines aspects of our two previous experiments. In this experiment, we consider three disjoint groups. For each disjoint group we use a categorical distribution with nonuniform event probabilities for sampling items within baskets, which induces a positive correlation structure within each group. Therefore, the oracle will expect to see a high negative correlation for disjoint pairs, compared to all other non-disjoint pairs within a particular disjoint group. For items with a high co-occurrence probability, we expect the symmetric DPP to recover a near zero negative correlation, and the nonsymmetric DPP to recover a positive correlation. Furthermore, we expect both the nonsymmetric and symmetric models to recover higher marginal probabilities, or values, for more popular items. The determinantal volumes for positive pairs containing popular items will thus tend to be larger than the volumes of negative pairs. Therefore, for baskets containing popular items, we expect that both the nonsymmetric and symmetric models will be able to easily discriminate between positive and negative baskets. When constructing positive baskets, popular items are sampled with high probability, proportional to their popularity. We therefore expect that both models will be able to recover some signal about the correlation structure of the data within each disjoint group, resulting in a predictive AUC higher than 0.5, since the popularity-based positive correlation structure within each group allows the model to recover some structure about correlations among item pairs within each group. However, we expect that the nonsymmetric model will provide better predictive performance than the symmetric model, since its properties enable recovery of disjoint structure (as discussed previously). We see the expected results in Figure 3, which are further confirmed by the predictive AUC results: for the symmetric model, and for the nonsymmetric model.

Figure 1: Results for synthetic experiment showing model recovery of structure of positive examples for low-sparsity data.
Figure 2: Results for synthetic experiment showing model recovery of structure of negative examples for high-sparsity data. 14 disjoint groups are used for data generation.
Figure 3: Results for synthetic experiment showing model recovery of positive structure for data with popularity-based positive examples and disjoint groups. Three disjoint sets and popularity-based weighted random generation are used for the positive examples.
Metric
Amazon: Apparel
Amazon: 3-category
UK Retail
Sym DPP Nonsym DPP Sym DPP Nonsym DPP Sym DPP Nonsym DPP
MPR 77.42 1.12 80.32 0.75 60.61 0.94 75.09 0.85 76.79 0.60 79.45 0.57
AUC 0.66 0.01 0.73 0.01 0.70 0.01 0.79 0.01 0.57 0.001 0.65 0.01
Table 1:

MPR and AUC results for the Amazon Diaper, Amazon three-category (Apparel + Diaper + Feeding), and UK retail datasets. Results show mean and 95% confidence estimates obtained using bootstrapping. Bold values indicate improvement over the symmetric low-rank DPP outside of the confidence interval. We use

for both Amazon datasets; for the Amazon 3-category dataset; for the Amazon apparel dataset; for the UK dataset; and for all datasets.

5.4 Results on real-world datasets

Figure 4: Percentage of positive pairwise correlations encoded by nonsymmetric DPP when trained on the three-category Amazon baby registry dataset, as a fraction of all possible pairwise correlations. Category is denoted by .

To examine how the nonsymmetric model behaves when trained on a real-world dataset with clear disjoint structure, we first train and evaluate the model on the three-category Amazon baby registry dataset. This dataset is composed of three disjoint categories of items, where each disjoint category is composed of 100 items and approximately 10,000 baskets. Given the structure of this dataset, with a small item catalog for each category and a large number of baskets relative to the size of the catalog, we would expect a relatively high density of positive pairwise item correlations within each category. Furthermore, since each category is disjoint, we would expect the model to recover a low density of positive correlations between pairs of items that are disjoint, since these items do not co-occur within observed baskets. We see the experimental results for this three-category dataset in Figure 4. As expected, positive correlations dominate within each category, e.g., within category 1, the model encodes 80.5% of the pairwise interactions as positive correlations. For pairwise interactions between items within two disjoint categories, we see that negative correlations dominate, e.g., between and , the model encodes 97.2% of the pairwise interactions as negative correlations (or equivalently, 2.8% as positive interactions).

Table 1 shows the results of our performance evaluation on the Amazon and UK datasets. Compared to the symmetric DPP, we see that the nonsymmetric DPP provides moderate to large improvements on both the MPR and AUC metrics for all datasets. In particular, we see a substantial improvement on the three-category Amazon dataset, providing further evidence that the nonsymmetric DPP is far more effective than the symmetric DPP at recovering the structure of data that contains large disjoint components.

6 Conclusion

By leveraging a low-rank decomposition of the nonsymmetric DPP kernel, we have introduced a tractable MLE-based algorithm for learning nonsymmetric DPPs from data. To the best of our knowledge, this is the first MLE-based learning algorithm for nonsymmetric DPPs. A general framework for the theoretical analysis of the properties of the maximum likelihood estimator for a somewhat restricted class of nonsymmetric DPPs reveals that this estimator has certain statistical guarantees regarding its consistency. While symmetric DPPs are limited to capturing only repulsive item interactions, nonsymmetric DPPs allow for both repulsive and attractive item interactions, which lead to fundamental changes in model behavior. Through an extensive experimental evaluation on several synthetic and real-world datasets, we have demonstrated that nonsymmetric DPPs can provide significant improvements in modeling power, and predictive performance, compared to symmetric DPPs. We believe that our contributions open to the door to an array of future work on nonsymmetric DPPs, including an investigation of sampling algorithms, reductions in computational complexity for learning, and further theorectical understanding of the properties of the model.

References

  • Affandi et al. [2014] R. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In ICML, 2014.
  • Anari et al. [2016] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei.

    Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes.

    In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 103–115, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
  • Borcea et al. [2009] Julius Borcea, Petter Brändén, and Thomas Liggett. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, 22(2):521–567, 2009.
  • Borodin [2009] Alexei Borodin. Determinantal Point Processes. arXiv:0911.1153, 2009.
  • Brunel [2018] Victor-Emmanuel Brunel. Learning signed determinantal point processes through the principal minor assignment problem. In NeurIPS, pages 7365–7374, 2018.
  • Brunel et al. [2017] Victor-Emmanuel Brunel, Ankur Moitra, Philippe Rigollet, and John Urschel. Rates of estimation for determinantal point processes. In Conference on Learning Theory, pages 343–345, 2017.
  • Chao et al. [2015] Wei-Lun Chao, Boqing Gong, Kristen Grauman, and Fei Sha. Large-margin determinantal point processes. In

    Uncertainty in Artificial Intelligence (UAI)

    , 2015.
  • Chen [2012] D Chen. Data mining for the online retail industry: A case study of rfm model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, 19(3), August 2012.
  • Decreusefond et al. [2015] Laurent Decreusefond, Ian Flint, Nicolas Privault, and Giovanni Luca Torrisi. Determinantal Point Processes, 2015.
  • Dupuy and Bach [2016] Christophe Dupuy and Francis Bach. Learning Determinantal Point Processes in sublinear time, 2016.
  • Gartrell et al. [2016] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian low-rank determinantal point processes. In RecSys. ACM, 2016.
  • Gartrell et al. [2017] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Low-rank factorization of Determinantal Point Processes. In AAAI, 2017.
  • Gillenwater [2014] J. Gillenwater. Approximate Inference for Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2014.
  • Gillenwater et al. [2014] J. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectation-maximization for learning Determinantal Point Processes. In NIPS, 2014.
  • Hu et al. [2008] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.
  • Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • Krause et al. [2008] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. JMLR, 9:235–284, 2008.
  • Kulesza [2013] A. Kulesza. Learning with Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2013.
  • Kulesza and Taskar [2011] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal point processes. In ICML, 2011.
  • Kulesza and Taskar [2012] A. Kulesza and B. Taskar. Determinantal Point Processes for machine learning, volume 5. Foundations and Trends in Machine Learning, 2012.
  • Lavancier et al. [2015] Frédéric Lavancier, Jesper Møller, and Ege Rubak. Determinantal Point Process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):853–877, 2015.
  • Li et al. [2016] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast dpp sampling for nystrom with application to kernel methods. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2061–2070, New York, New York, USA, 20–22 Jun 2016. PMLR.
  • Li et al. [2010] Yanen Li, Jia Hu, ChengXiang Zhai, and Ye Chen. Improving one-class collaborative filtering by incorporating rich user information. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, 2010.
  • Lin and Bilmes [2012] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization. In Uncertainty in Artificial Intelligence (UAI), 2012.
  • Mariet and Sra [2015] Zelda Mariet and Suvrit Sra. Fixed-point algorithms for learning Determinantal Point Processes. In ICML, 2015.
  • Mariet and Sra [2016] Zelda Mariet and Suvrit Sra. Kronecker Determinantal Point Processes. In NIPS, 2016.
  • Rebeschini and Karbasi [2015] Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. In Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1480–1500, Paris, France, 03–06 Jul 2015. PMLR.
  • Tsatsomeros [2004] Michael J. Tsatsomeros. Focus on computational neurobiology. chapter Generating and Detecting Matrices with Positive Principal Minors, pages 115–132. Nova Science Publishers, Inc., 2004.
  • Zhang et al. [2017] Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Stochastic learning on imbalanced data: Determinantal Point Processes for mini-batch diversification. CoRR, abs/1705.00607, 2017.

Appendix A Counterexamples to the backward implication in Part 3 of Lemma 3

Here, under nonsymmetric scenarios, we provide some counterexamples that show that (10) can be satisfied for nonzero matrices even though is not block diagonal. This implies that irreducible (i.e., not block diagonal up to relabeling of the items) matrices may be significantly harder to learn in the nonsymmetric case.

Case of signed -matrices with unknown signs

Let and define

and

Then, , it satisfies (10) and yet, is irreducible.

In the case, Define

and

Then, , it satisfies (10) and yet, is irreducible. Note that is symmetric. However, since we are under a scenario where it is not known beforehand that is symmetric, the space of perturbations is much larger.

Case of matrices of the form , where is diagonal with positive entries and is skew-symmetric

Let

and

Again, we see that is nonzero, it satisfies (10) and yet, is irreducible.

The case of matrices that are the sum of a positive diagonal matrix and a skew-symmetric matrix is particularly interesting, in applications where only attractive interactions between items (i.e., nonnegative correlations) are sought for. In this case, it would be interesting to be able to characterize the nullspace of the Fisher information, as in [6, Theorem 3].

Open Question.

Let be the set of all matrices , where is a positive diagonal matrix and is skew-symmetric. Recall that by Lemma 2, is the set of all matrices of the form where is any diagonal matrix and is any skew-symmetric matrix.

  1. Characterize the set of all such that is the only solution in to (10).

  2. For a given , characterize the set of all solutions of (10).

Appendix B Proofs

Proof of Lemma 1

By the generating method 4.2 in [28], if is positive definite, then is a -matrix, hence, it is a -matrix. Assume now that is only PSD. Let and . Then, is positive definite, hence, is a -matrix, by the argument given above, so it is a -matrix. Let , for . Then, is a continuous function and the set of -matrices is the set of all matrices with , hence, it is closed. Therefore, is a -matrix, as the limit of the -matrix as .

Proof of Lemma 2

We only prove the third statement, since the first and the fifth ones are very simple, and the second and the fourth ones are directly implied by the third. For , we let be the elementary matrix with zeros everywhere but at the -th entry, where we put a one.

If , then and are both in , hence, their difference is in . If , note that and are both in , since they are diagonally dominant. Hence, their difference is in , yielding . Therefore, for all , yielding , since all matrices are linear combinations of the elementary matrices.

Proof of Theorem 1

The arguments are almost identical to the ones given in [6, Theorem 2]. The main idea is to note that for all matrices ,

(13)

which is a consequence of the -linearity of the determinant. Then, we differentiate (13) twice on the set of -matrices, by noticing that this is an open set. Indeed, recalling the notation from the proof of Lemma 1, the set of -matrices is the set of all matrices such that , and is continuous. Hence, following the same computations as in the proof of [6, Theorem 2] yields the desired result.

Proof of Lemma 3

  1. Take in (10) for some . Then, is a matrix, whose single entry is the -th diagonal entry of . Since , , yielding that must be zero.

  2. Now, take and write , where and are nonzero since is a -matrix and by assumption. Using the first part of the Lemma, it must hold that , hence, we write for some . A direct computation shows that if (10) is satisfied, then must be zero.

Appendix C Mean Percentile Rank

We begin our definition of MPR by defining percentile rank (PR). First, given a set , let . The percentile rank of an item given a set is defined as

where indicates those elements in the ground set that are not found in .

MPR is then computed as

where is the set of test instances and is a randomly selected element in each set .