1 Introduction
Determinantal point processes (DPPs) have attracted growing attention from the machine learning community as an elegant probablistic model for the relationship between items within observed subsets, drawn from a large collection of items. DPPs have been well studied for their theoretical properties Affandi et al. (2014); Borodin (2009); Decreusefond et al. (2015); Gillenwater (2014); Kulesza (2013); Kulesza and Taskar (2012); Lavancier et al. (2015), and have been applied to numerous machine learning applications, including document summarization Chao et al. (2015); Lin and Bilmes (2012), recommender systems Gartrell et al. (2016), object retrieval Affandi et al. (2014), sensor placement Krause et al. (2008), information retrieval Kulesza and Taskar (2011), and minibatch selection Zhang et al. (2017). Efficient algorithms for DPP learning Dupuy and Bach (2016); Gartrell et al. (2017); Gillenwater et al. (2014); Mariet and Sra (2015, 2016) and sampling Anari et al. (2016); Li et al. (2016); Rebeschini and Karbasi (2015) have been reasonably well studied. DPPs are conventionally parameterized by a positive semidefinite (PSD) kernel matrix, and due to this symmetric kernel, they are able to encode only repulsive interactions between items. Despite this limitation, symmetric DPPs have significant expressive power, and have proven effective in the aforementioned applications. However, the ability to encode only repulsive interactions, or negative correlations between pairs of items, does have important limitations in some settings. For example, consider the case of a recommender system for a shopping website, where the task is to provide good recommendations for items to complete a user’s shopping basket prior to checkout. For models that can only encode negative correlations, such as the symmetric DPP, it is impossible to directly encode positive interactions between items; e.g., a purchased basket containing a video game console would be more likely to also contain a game controller. One way to resolve this limitation is to consider nonsymmetric DPPs, which relax the symmetric constraint on the kernel.
Nonsymmetric DPPs allow the model to encode both repulsive and attractive item interactions, which can significantly improve modeling power. With one notable exception Brunel (2018), little attention has been given to nonsymmetric DPPs within the machine learning community. We present a method for learning fully nonsymmetric DPP kernels from data composed of observed subsets, where we leverage a lowrank decomposition of the nonsymmetric kernel that enables a tractable learning algorithm based on maximum likelihood estimation (MLE).
Contributions
Our work makes the following contributions:

We present a decomposition of the nonsymmetric DPP kernel that enables a tractable MLEbased learning algorithm. To the best of our knowledge, this is the first MLEbased learning algorithm for nonsymmetric DPPs.

We present a general framework for the theoretical analysis of the properties of the maximum likelihood estimator for a somewhat restricted class of nonsymmetric DPPs, which shows that this estimator has particular statistical guarantees regarding consistency.

Through an extensive experimental evaluation on several synthetic and realworld datasets, we highlight the significant improvements in modeling power that nonsymmetric DPPs provide in comparison to symmetric DPPs. We see that nonsymmetric DPPs are more effective at recovering correlation structure within data, particularly for data that contains large disjoint collections of items.
2 Background
A DPP models a distribution over subsets of a finite ground set that is parametrized by a matrix , such that for any ,
(1) 
where is the submatrix of indexed by .
Since the normalization constant for Eq. 1 follows from the observation that , we have, for all ,
(2) 
Without loss of generality, we will assume that , which we also denote by , where is the cardinality of .
It is common to assume that is a positive semidefinite matrix in order to ensure that
defines a probability distribution on the power set of
Kulesza and Taskar (2012). More generally, any matrix whose principal minors , are nonnegative, is admissible to define a probability distribution as in (2) Brunel (2018); such matrices are called matrices. Recall that any matrix can be decomposed uniquely as the sum of a symmetric matrixand a skewsymmetric matrix
. Namely, whereas . The following lemma gives a simple sufficient condition on for to be a matrix.Lemma 1.
Let be an arbitrary matrix. If is PSD, then is a matrix.
An important consequence is that a matrix of the form , where is diagonal with positive diagonal entries and is skewsymmetric, is a matrix. Such a matrix would only capture nonnegative correlations, as explained in the next section.
2.1 Capturing Positive and Negative Correlations
When DPPs are used to model real data, they are often formulated in terms of the matrix as described above, called an ensemble. However, DPPs can be alternatively represented in terms of the matrix , where . Using the representation,
(3) 
where is a random subset drawn from . is called the marginal kernel; since here we are defining marginal probabilities that don’t need to sum to 1, no normalization constant is needed. DPPs are conventionally parameterized by a PSD or matrix, which is symmetric.
However, and need not be symmetric. As shown in Brunel (2018), is admissible if and only if is a matrix, that is, all of its principal minors are nonnegative. The class of matrices is much larger, and allows us to accommodate nonsymmetric and matrices. To enforce the constraint on during learning, we impose the decomposition of described in Section 4. Since we see as consequence of Lemma 1 that the sum of a PSD matrix and a skewsymmetric matrix is a matrix, this allows us to support nonsymmetric kernels, while ensuring that is a matrix. As we will see in the following, there are significant advantages to accommodating nonsymmetric kernels in terms of modeling power.
As shown in Kulesza and Taskar (2012)
, the eigenvalues of
are bounded above by one, while need only be a PSD or matrix. Furthermore, gives the marginal probabilities of subsets, while directly models the atomic probabilities of observed each subset of . For these reasons, most work on learning DPPs from data uses the representation of a DPP.If is a singleton set, then . The diagonal entries of directly correspond to the marginal inclusion probabilities for each element of . If is a set containing two elements, then we have
(4) 
Therefore, the offdiagonal elements determine the correlations between pairs of items; that is, . For a symmetric , the signs and magnitudes of and are the same, resulting in . We see that in this case, the offdiagonal elements represent negative correlations between pairs of items, where a larger value of leads to a lower probability of and cooccurring, while a smaller value of indicates a higher cooccurrence probability. If , then there is no correlation between this pair of items. Since the sign of the term is always nonpositive, the symmetric model is able to capture only nonpositive correlations between items. In fact, symmetric DPPs induce a strong negative dependence between items, called negative association Borcea et al. (2009).
For a nonsymmetric , the signs and magnitudes of and may differ, resulting in . In this case, the offdiagonal elements represent positive correlations between pairs of items, where a larger value of leads to a higher probability of and cooccurring, while a smaller value of indicates a lower cooccurrence probability. Of course, the signs of the offdiagonal elements for some pairs may be the same in a nonsymmetric , which allows the model to also capture negative correlations. Therefore, a nonsymmetric can capture both negative and positive correlations between pairs of items.
3 General guarantees in maximum likelihood estimation for DPPs
In this section we define the loglikelihood function and we study the Fisher information of the model. The Fisher information controls whether the maximum likelihood, computed on iid samples, will be a consistent. When the matrix is not invertible (i.e., if it is only a matrix and not a matrix), the support of , defined as the collection of all subsets such that , depends on , and the Fisher information will not be defined in general. Hence, we will assume, in this section, that is invertible and that we only maximize the loglikelihood over classes of invertible matrices .
Consider a subset of the set of all matrices of size . Given a collection of observed subsets composed of items from , our learning task is to fit a DPP kernel based on this data. For all , the loglikelihood is defined as
(5) 
where is the proportion of observed samples that equal .
Now, assume that are iid copies of a DPP with kernel . For all , the population loglikelihood is defined as the expectation of , i.e.,
(6) 
where .
The maximum likelihood estimator (MLE) is defined as a minimizer of over the parameter space . Since can be viewed as a perturbed version of , it can be convenient to introduce the space defined as the linear subspace of spanned by and define the successive derivatives of and as multilinear forms on . As we will see later on, the complexity of the model can be captured in the size of the space . The following lemma provides a few examples. We say that a matrix is a signed matrix if for all , for some .
Lemma 2.

If is the set of all positive definite matrices, it is easy to see that is the set of all symmetric matrices.

If is the set of all matrices, then .

If is the collection of signed matrices, then .

If is the set of matrices of the form , where is a symmetric matrix and is a skewsymmetric matrix (i.e., ), then .

If is the set of signed matrices with known signed pattern (i.e., there exists such that for all and all , ), then is the collection of all signed matrices with that same sign pattern. In particular, if is the set of all matrices of the form where is diagonal and is skewsymmetric, then is the collection of all matrices that are the sum of a diagonal and a skewsymmetric matrix.
It is easy to see that the population loglikelihood is infinitely many times differentiable on the relative interior of and that for all in the relative interior of and ,
(7) 
and
(8) 
Hence, we have the following theorem. The case of symmetric kernels is studied in Brunel et al. (2017) and the following result is a straightforward extension to arbitrary parameter spaces. For completeness, we include the proof in the appendix. For a set , we call the relative interior of its interior in the linear space spanned by .
Theorem 1.
Let be a set of matrices and let be in the relative interior of . Then, for all , . Moreover, the Fisher information is the negative Hessian of at and is given by
(9) 
where is a DPP with kernel .
It follows that the Fisher information is positive definite if and only if any that verifies
(10) 
must be . When is the space of symmetric and positive definite kernels, the Fisher information is definite if and only if is irreducible, i.e., it is not blockdiagonal up to a permutation of its rows and columns Brunel et al. (2017). In that case, it is shown that the MLE learns at the speed . In general, this property fails and even irreducible kernels can induce a singular Fisher information.
Lemma 3.
In particular, this lemma implies that if is a class of signed matrices with prescribed sign pattern (i.e., for all and all , where the ’s are and do not depend on ), then if lies in the relative interior of and has no zero entries, the Fisher information is definite.
In the symmetric case, it is shown in Brunel et al. (2017) that the only matrices satisfying (10) must be supported off the diagonal blocks of , i.e., the third part of Lemma 3 is an equivalence. In the appendix, we provide a few very simple counterexamples that show that this equivalence is no longer valid in the nonsymmetric case.
4 Model
To add support for positive correlations to the DPP, we consider nonsymmetric matrices. In particular, our approach involves incorporating a skewsymmetric perturbation to the PSD .
Recall that any matrix can be uniquely decomposed as , where is symmetric and is skewsymmetric. We impose a decomposition on as , where and are lowrank matrices, and we use a lowrank factorization of , , where is a lowrank matrix, as described in Gartrell et al. (2017), which also allows us to enforce to be PSD and hence, to be a matrix by Lemma 1.
We define a regularization term, , as
(11) 
where counts the number of occurrences of item in the training set, , , and
are the corresponding row vectors of
, , and , respectively, andare tunable hyperparameters. This regularization formulation is similar to that proposed in
Gartrell et al. (2017). From the above, we have the full formulation of the regularized loglikelihood of our model:(12) 
The computational complexity of Eq. 12 will be dominated by computing the determinant in the second term (the normalization constant), which is . Furthermore, since , the computational complexity of computing the gradient of Eq. 12 during learning will be dominated by computing the matrix inverse in the gradient of the second term, , which is . Therefore, we see that the lowrank decomposition of the kernel in our nonsymmetric model does not afford any improvement over a fullrank model in terms of computational complexity. However, our lowrank decomposition does provide a savings in terms of the memory required to store model parameters, since our lowrank model has space complexity , while a fullrank version of this nonsymmetric model has space complexity . When and , which is typical in many settings, this will result in a significant space savings.
5 Experiments
We run extensive experiments on several synthetic and realworld datasets. Since the focus of our work is on improving DPP modeling power and comparing nonsymmetric and symmetric DPPs, we use the standard symmetric lowrank DPP as the baseline model for our experiments.
5.1 Datasets
We perform nextitem prediction and AUCbased classification experiments on two realworld datasets composed of purchased shopping baskets:

Amazon Baby Registries: This public dataset consists of 111,0006 registries or "baskets" of baby products, and has been used in prior work on DPP learning Gartrell et al. (2016); Gillenwater et al. (2014); Mariet and Sra (2015). The registries are collected from 15 different categories, such as "apparel", "diapers", etc., and the items in each category are disjoint.We evaluate our models on the popular apparel category.
We also perform an evaluation on a dataset composed of the three most popular categories: apparel, diaper, and feeding. We construct this dataset, composed of three large disjoint categories of items, with a catalog of 100 items in each category, to highlight the differences in how nonsymmetric and symmetric DPPs model data. In particular, we will see that the nonsymmetric DPP uses positive correlations to capture item cooccurences within baskets, while negative correlations are used to capture disjoint pairs of items. In contrast, since symmetric DPPs can only represent negative correlations, they must attempt to capture both cooccuring items, and items that are disjoint, using only negative correlations.

UK Retail: This is a public dataset Chen (2012) that contains 25,898 baskets drawn from a catalog of 4,070 items. This dataset contains transactions from a nonstore online retail company that primarily sells unique alloccasion gifts, and many customers are wholesalers. We omit all baskets with more than 100 items, which allows us to use a lowrank factorization of the symmetric DPP () that scales well in training and prediction time, while also keeping memory consumption for model parameters to a manageable level.

We also perform an evaluation on synthetically generated data. Our data generator allows us to explicitly control the item catalog size, the distribution of set sizes, and the item cooccurrence distribution. By controlling these parameters, we are able to empirically study how the nonsymmetric and symmetric models behave for data with a specified correlation structure.
5.2 Experimental setup and metrics
Nextitem prediction involves identifying the best item to add to a subset of selected items (e.g., basket completion), and is the primary prediction task we evaluate.
We compute a nextitem prediction for a basket by conditioning the DPP on the event that all items in are observed. As described in Gillenwater (2014), we compute this conditional kernel, , as , where , is the restriction of to the rows and columns indexed by , and consists of the rows and columns of . The computational complexity of this operation is dominated by the three matrix multiplications, which is .
We compare the performance of all methods using a standard recommender system metric: mean percentile rank (MPR). A MPR of 50 is equivalent to random selection; a MPR of 100 indicates that the model perfectly predicts the held out item. MPR is a recallbased metric which we use to evaluate the model’s predictive power by measuring how well it predicts the next item in a basket; it is a standard choice for recommender systems Hu et al. (2008); Li et al. (2010). See Appendix C for a formal description of how the MPR metric is computed.
We evaluate the discriminative power of each model using the AUC metric. For this task, we generate a set of negative subsets uniformly at random. For each positive subset in the test set, we generate a negative subset of the same length by drawing samples uniformly at random, and ensure that the same item is not drawn more than once for a subset. We compute the AUC for the model on these positive and negative subsets, where the score for each subset is the loglikelihood that the model assigns to the subset. This task measures the ability of the model to discriminate between observed positive subsets (groundtruth subsets) and randomly generated subsets.
For all experiments, a random selection of 80% of the baskets are used for training, and the remaining 20% are used for testing. We use a small heldout validation set for tracking convergence and tuning hyperparameters. Convergence is reached during training when the relative change in validation loglikelihood is below a predetermined threshold, which is set identically for all models. We implement our models using PyTorch, and use the Adam
Kingma and Ba (2015) optimization algorithm to train our models.5.3 Results on synthetic datasets
We run a series of synthetic experiments to examine the differences between nonsymmetric and symmetric DPPs. In all of these experiments, we define an oracle that controls the generative process for the data. The oracle uses a deterministic policy to generate a dataset composed of positive baskets (items that cooccur) and negative baskets (items that don’t cooccur). This generative policy defines the expected normalized determinant, , for each pair of items, and a threshold that limits the maximum determinantal volume for a positive basket and the minimum volume for a negative basket. This threshold is used to compute AUC results for this set of positives and negatives. Note that the negative sets are used only during evaluation. For each experiment, in Figures 1, 2, and 3, we plot a transformed version of the learned matrices for the nonsymmetric and symmetric models, where each element of this matrix is reweighted by for the corresponding pair. For each plotted transformation of , a red element corresponds to a negative correlation, which will tend to result in the model predicting that the corresponding pair is negative pair. Black and green elements correspond to smaller and larger positive correlations, respectively, for the nonsymmetric model, and very small negative correlations for the symmetric model; the model will tend to predict that the corresponding pair is positive in these cases. We perform the AUCbased evaluation for each pair by comparing predicted by the model with the ground truth determinantal volume provided by the oracle; this task is equivalent to performing basket completion for the pair. In Figures 1, 2, and 3, we show the prediction error for each pair, where green corresponds to low error, and red corresponds to high error.
Recovering positive examples for lowsparsity data
In this experiment we aim to show that the nonsymmetric model is just as capable as the symmetric model when it comes to learning negative correlations when trained on data containing few negative correlations and many positive correlations. We choose a setting where the symmetric model performs well. We construct a dataset that contains no large disjoint collections of items, with 100 baskets of size six, and a catalog of 100 items. To reduce the impact of negative correlations between items, we use a categorical distribution, with nonuniform event probabilities, for sampling the items that populate each basket, with a large coverage of possible item pairs. This logic ensures few negative correlations, since there is a low probability that two items will never cooccur. For the nonsymmetric DPP, the oracle expects the model to predict a low negative correlation, or a positive correlation, for a pair of products that have a high cooccurence probability in the data. The results of this experiment are shown in Figure 1. We see from the plots showing the transformed matrices that both the nonsymmetric and symmetric models recover approximately the same structure, resulting in similar error plots, and similar predictive AUC of approximately 0.8 for both models.
Recovering negative examples for highsparsity data
We construct a more challenging scenario for this experiment, which reveals an important limitation of the symmetric DPP. The symmetric DPP requires a relatively high density of observed item pairs (positive pairs) in order to learn the negative structure of the data that describes items that do not cooccur. During learning, the DPP will maximize determinantal volumes for positive pairs, while the normalization constant maintains a representation of the global volume of the parameter space for the entire item catalog. For a high density of observed positive pairs, increasing the volume allocated to positive pairs will result in a decrease in the volume assigned to many negative pairs, in order to maintain approximately the same global volume represented by the normalization constant. For a low density of positive pairs, the model will not allocate low volumes to many negative pairs. This phenomenon affects both the nonsymmetric and symmetric models. Therefore, the difference in each model’s ability to capture negative structure within a lowdensity region of positive can be explained in terms of how each model maximizes determinatal volumes using positive and negative correlations. In the case of the symmetric DPP, the model can increase determinantal volumes by using smaller negative correlations, resulting in offdiagonal values that approach zero. As these offdiagonal parameters approach zero, this behavior has the side effect of also increasing the determinantal volumes of subsets within disjoint groups, since these volumes are also affected by these small parameter values. In contrast, the nonsymmetric model behaves differently; determinantal volumes can be maximized by switching the signs of the offdiagonal entries of and increasing the magnitude of these parameters, rather than reducing the values of these parameters to near zero. This behavior allows the model to assign higher volumes to positive pairs than to negative pairs within disjoint groups in many cases, thus allowing the nonsymmetric model to recover disjoint structure.
In our experiment, the oracle controls the sparsity of the data by setting the number of disjoint groups of items; positive pairs within each disjoint group are generated uniformly at random, in order to focus on the effect of disjoint groups. For the AUC evaluation, negative baskets are constructed so that they contain items from at least two different disjoint groups. When constructing our dataset, we set the number of disjoint groups to 14, with 100 baskets of size six, and a catalog of 100 items. The results of our experiment are shown in Figure 2. We see from the error plot that the symmetric model cannot effectively learn the structure of the data, leading to high error in many areas, including within the disjoint blocks; the symmetric model provides an AUC of 0.5 as a result. In contrast, the nonsymmetric model is able to approximately recover the block structure, resulting in an AUC of 0.7.
Recovering positive examples for data that mixes disjoint sparsity with popularitybased positive structure
For our final synthetic experiment, we construct a scenario that combines aspects of our two previous experiments. In this experiment, we consider three disjoint groups. For each disjoint group we use a categorical distribution with nonuniform event probabilities for sampling items within baskets, which induces a positive correlation structure within each group. Therefore, the oracle will expect to see a high negative correlation for disjoint pairs, compared to all other nondisjoint pairs within a particular disjoint group. For items with a high cooccurrence probability, we expect the symmetric DPP to recover a near zero negative correlation, and the nonsymmetric DPP to recover a positive correlation. Furthermore, we expect both the nonsymmetric and symmetric models to recover higher marginal probabilities, or values, for more popular items. The determinantal volumes for positive pairs containing popular items will thus tend to be larger than the volumes of negative pairs. Therefore, for baskets containing popular items, we expect that both the nonsymmetric and symmetric models will be able to easily discriminate between positive and negative baskets. When constructing positive baskets, popular items are sampled with high probability, proportional to their popularity. We therefore expect that both models will be able to recover some signal about the correlation structure of the data within each disjoint group, resulting in a predictive AUC higher than 0.5, since the popularitybased positive correlation structure within each group allows the model to recover some structure about correlations among item pairs within each group. However, we expect that the nonsymmetric model will provide better predictive performance than the symmetric model, since its properties enable recovery of disjoint structure (as discussed previously). We see the expected results in Figure 3, which are further confirmed by the predictive AUC results: for the symmetric model, and for the nonsymmetric model.






Sym DPP  Nonsym DPP  Sym DPP  Nonsym DPP  Sym DPP  Nonsym DPP  
MPR  77.42 1.12  80.32 0.75  60.61 0.94  75.09 0.85  76.79 0.60  79.45 0.57  
AUC  0.66 0.01  0.73 0.01  0.70 0.01  0.79 0.01  0.57 0.001  0.65 0.01 
MPR and AUC results for the Amazon Diaper, Amazon threecategory (Apparel + Diaper + Feeding), and UK retail datasets. Results show mean and 95% confidence estimates obtained using bootstrapping. Bold values indicate improvement over the symmetric lowrank DPP outside of the confidence interval. We use
for both Amazon datasets; for the Amazon 3category dataset; for the Amazon apparel dataset; for the UK dataset; and for all datasets.5.4 Results on realworld datasets
To examine how the nonsymmetric model behaves when trained on a realworld dataset with clear disjoint structure, we first train and evaluate the model on the threecategory Amazon baby registry dataset. This dataset is composed of three disjoint categories of items, where each disjoint category is composed of 100 items and approximately 10,000 baskets. Given the structure of this dataset, with a small item catalog for each category and a large number of baskets relative to the size of the catalog, we would expect a relatively high density of positive pairwise item correlations within each category. Furthermore, since each category is disjoint, we would expect the model to recover a low density of positive correlations between pairs of items that are disjoint, since these items do not cooccur within observed baskets. We see the experimental results for this threecategory dataset in Figure 4. As expected, positive correlations dominate within each category, e.g., within category 1, the model encodes 80.5% of the pairwise interactions as positive correlations. For pairwise interactions between items within two disjoint categories, we see that negative correlations dominate, e.g., between and , the model encodes 97.2% of the pairwise interactions as negative correlations (or equivalently, 2.8% as positive interactions).
Table 1 shows the results of our performance evaluation on the Amazon and UK datasets. Compared to the symmetric DPP, we see that the nonsymmetric DPP provides moderate to large improvements on both the MPR and AUC metrics for all datasets. In particular, we see a substantial improvement on the threecategory Amazon dataset, providing further evidence that the nonsymmetric DPP is far more effective than the symmetric DPP at recovering the structure of data that contains large disjoint components.
6 Conclusion
By leveraging a lowrank decomposition of the nonsymmetric DPP kernel, we have introduced a tractable MLEbased algorithm for learning nonsymmetric DPPs from data. To the best of our knowledge, this is the first MLEbased learning algorithm for nonsymmetric DPPs. A general framework for the theoretical analysis of the properties of the maximum likelihood estimator for a somewhat restricted class of nonsymmetric DPPs reveals that this estimator has certain statistical guarantees regarding its consistency. While symmetric DPPs are limited to capturing only repulsive item interactions, nonsymmetric DPPs allow for both repulsive and attractive item interactions, which lead to fundamental changes in model behavior. Through an extensive experimental evaluation on several synthetic and realworld datasets, we have demonstrated that nonsymmetric DPPs can provide significant improvements in modeling power, and predictive performance, compared to symmetric DPPs. We believe that our contributions open to the door to an array of future work on nonsymmetric DPPs, including an investigation of sampling algorithms, reductions in computational complexity for learning, and further theorectical understanding of the properties of the model.
References
 Affandi et al. [2014] R. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In ICML, 2014.

Anari et al. [2016]
Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei.
Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes.
In 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 103–115, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.  Borcea et al. [2009] Julius Borcea, Petter Brändén, and Thomas Liggett. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, 22(2):521–567, 2009.
 Borodin [2009] Alexei Borodin. Determinantal Point Processes. arXiv:0911.1153, 2009.
 Brunel [2018] VictorEmmanuel Brunel. Learning signed determinantal point processes through the principal minor assignment problem. In NeurIPS, pages 7365–7374, 2018.
 Brunel et al. [2017] VictorEmmanuel Brunel, Ankur Moitra, Philippe Rigollet, and John Urschel. Rates of estimation for determinantal point processes. In Conference on Learning Theory, pages 343–345, 2017.

Chao et al. [2015]
WeiLun Chao, Boqing Gong, Kristen Grauman, and Fei Sha.
Largemargin determinantal point processes.
In
Uncertainty in Artificial Intelligence (UAI)
, 2015.  Chen [2012] D Chen. Data mining for the online retail industry: A case study of rfm modelbased customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, 19(3), August 2012.
 Decreusefond et al. [2015] Laurent Decreusefond, Ian Flint, Nicolas Privault, and Giovanni Luca Torrisi. Determinantal Point Processes, 2015.
 Dupuy and Bach [2016] Christophe Dupuy and Francis Bach. Learning Determinantal Point Processes in sublinear time, 2016.
 Gartrell et al. [2016] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Bayesian lowrank determinantal point processes. In RecSys. ACM, 2016.
 Gartrell et al. [2017] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Lowrank factorization of Determinantal Point Processes. In AAAI, 2017.
 Gillenwater [2014] J. Gillenwater. Approximate Inference for Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2014.
 Gillenwater et al. [2014] J. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectationmaximization for learning Determinantal Point Processes. In NIPS, 2014.
 Hu et al. [2008] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.
 Kingma and Ba [2015] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 Krause et al. [2008] Andreas Krause, Ajit Singh, and Carlos Guestrin. Nearoptimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. JMLR, 9:235–284, 2008.
 Kulesza [2013] A. Kulesza. Learning with Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2013.
 Kulesza and Taskar [2011] A. Kulesza and B. Taskar. kdpps: Fixedsize determinantal point processes. In ICML, 2011.
 Kulesza and Taskar [2012] A. Kulesza and B. Taskar. Determinantal Point Processes for machine learning, volume 5. Foundations and Trends in Machine Learning, 2012.
 Lavancier et al. [2015] Frédéric Lavancier, Jesper Møller, and Ege Rubak. Determinantal Point Process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):853–877, 2015.
 Li et al. [2016] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast dpp sampling for nystrom with application to kernel methods. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2061–2070, New York, New York, USA, 20–22 Jun 2016. PMLR.
 Li et al. [2010] Yanen Li, Jia Hu, ChengXiang Zhai, and Ye Chen. Improving oneclass collaborative filtering by incorporating rich user information. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, 2010.
 Lin and Bilmes [2012] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization. In Uncertainty in Artificial Intelligence (UAI), 2012.
 Mariet and Sra [2015] Zelda Mariet and Suvrit Sra. Fixedpoint algorithms for learning Determinantal Point Processes. In ICML, 2015.
 Mariet and Sra [2016] Zelda Mariet and Suvrit Sra. Kronecker Determinantal Point Processes. In NIPS, 2016.
 Rebeschini and Karbasi [2015] Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. In Proceedings of The 28th Conference on Learning Theory, volume 40 of Proceedings of Machine Learning Research, pages 1480–1500, Paris, France, 03–06 Jul 2015. PMLR.
 Tsatsomeros [2004] Michael J. Tsatsomeros. Focus on computational neurobiology. chapter Generating and Detecting Matrices with Positive Principal Minors, pages 115–132. Nova Science Publishers, Inc., 2004.
 Zhang et al. [2017] Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Stochastic learning on imbalanced data: Determinantal Point Processes for minibatch diversification. CoRR, abs/1705.00607, 2017.
Appendix A Counterexamples to the backward implication in Part 3 of Lemma 3
Here, under nonsymmetric scenarios, we provide some counterexamples that show that (10) can be satisfied for nonzero matrices even though is not block diagonal. This implies that irreducible (i.e., not block diagonal up to relabeling of the items) matrices may be significantly harder to learn in the nonsymmetric case.
Case of signed matrices with unknown signs
In the case, Define
and
Then, , it satisfies (10) and yet, is irreducible. Note that is symmetric. However, since we are under a scenario where it is not known beforehand that is symmetric, the space of perturbations is much larger.
Case of matrices of the form , where is diagonal with positive entries and is skewsymmetric
The case of matrices that are the sum of a positive diagonal matrix and a skewsymmetric matrix is particularly interesting, in applications where only attractive interactions between items (i.e., nonnegative correlations) are sought for. In this case, it would be interesting to be able to characterize the nullspace of the Fisher information, as in [6, Theorem 3].
Open Question.
Let be the set of all matrices , where is a positive diagonal matrix and is skewsymmetric. Recall that by Lemma 2, is the set of all matrices of the form where is any diagonal matrix and is any skewsymmetric matrix.
Appendix B Proofs
Proof of Lemma 1
By the generating method 4.2 in [28], if is positive definite, then is a matrix, hence, it is a matrix. Assume now that is only PSD. Let and . Then, is positive definite, hence, is a matrix, by the argument given above, so it is a matrix. Let , for . Then, is a continuous function and the set of matrices is the set of all matrices with , hence, it is closed. Therefore, is a matrix, as the limit of the matrix as .
Proof of Lemma 2
We only prove the third statement, since the first and the fifth ones are very simple, and the second and the fourth ones are directly implied by the third. For , we let be the elementary matrix with zeros everywhere but at the th entry, where we put a one.
If , then and are both in , hence, their difference is in . If , note that and are both in , since they are diagonally dominant. Hence, their difference is in , yielding . Therefore, for all , yielding , since all matrices are linear combinations of the elementary matrices.
Proof of Theorem 1
The arguments are almost identical to the ones given in [6, Theorem 2]. The main idea is to note that for all matrices ,
(13) 
which is a consequence of the linearity of the determinant. Then, we differentiate (13) twice on the set of matrices, by noticing that this is an open set. Indeed, recalling the notation from the proof of Lemma 1, the set of matrices is the set of all matrices such that , and is continuous. Hence, following the same computations as in the proof of [6, Theorem 2] yields the desired result.
Proof of Lemma 3

Take in (10) for some . Then, is a matrix, whose single entry is the th diagonal entry of . Since , , yielding that must be zero.

Now, take and write , where and are nonzero since is a matrix and by assumption. Using the first part of the Lemma, it must hold that , hence, we write for some . A direct computation shows that if (10) is satisfied, then must be zero.
Appendix C Mean Percentile Rank
We begin our definition of MPR by defining percentile rank (PR). First, given a set , let . The percentile rank of an item given a set is defined as
where indicates those elements in the ground set that are not found in .
MPR is then computed as
where is the set of test instances and is a randomly selected element in each set .