1 Introduction
Determinantal point processes (DPPs) have proven useful in numerous machine learning applications. For example, recent uses include summarization Sharghi et al. (2018), recommender systems Wilhelm et al. (2018)
, neural network compression
Mariet and Sra (2016), kernel approximation Li et al. (2016), multi-modal output generation Elfeki et al. (2019), and batch selection, both for stochastic optimization Zhang et al. (2017)and for active learning
Bıyık et al. (2019). DPPs have been applied in these cases because for each there existed a subset-selection problem, and a good solution to this problem was to select items that were high-quality but also diverse; DPPs provide a means of trading off quality with diversity in a principled way Kulesza et al. (2012).For subset selection problems where the ground set of items to select from has cardinality , the typical DPP is parameterized by an kernel matrix. Most prior work has been concerned with symmetric DPPs, where the kernel must equal its transpose. However, recent work has considered the more general class of nonsymmetric DPPs (NDPPs) and shown that these have additional useful modeling power Brunel (2018); Gartrell et al. (2019). In particular, NDPPs allow modeling of positive correlations between items, where the presence of item
in the selected set increases the probability that some other item
will also be selected. Symmetric DPPs cannot capture such positive correlations. There are many intuitive examples of how positive correlations can be of practical importance. For example, consider a product recommendation task for a website, where a user has a camera in her shopping cart, and the goal is to display several other items that she might purchase. Relative to an empty cart, the presence of the camera probably increases the probability that the person will buy an accessory like a tripod. Although NDPPs can model such behavior, scalable approaches for NDPP learning and MAP inference have not been studied.Here, we build on prior NDPP work and make the following contributions:
Learning: We propose a decomposition for NDPP kernels that reduces the complexity of learning from cubic Gartrell et al. (2019) to linear in , the size of the DPP ground set.
MAP inference: For the problem of finding the highest-probability subset under a DPP (the “MAP inference” problem), we analyze a standard greedy subset selection algorithm and show that, for low-rank NDPPs, it can be run in time linear in the size of the DPP ground set. We also describe how the same algorithm can be used to do conditioning for next-item prediction tasks. We then prove a lower bound on greedy MAP performance for all NDPPs.
We combine the above contributions through experiments that involve learning NDPP kernels and applying MAP inference to these kernels for several real-world datasets. These experiments demonstrate that our scalable decomposition matches or exceeds the predictive performance of prior work. We also show that, for small synthetic kernels generated from learned kernels, the proposed greedy MAP algorithm has an approximation quality that often substantially exceeds the lower bound, with near-optimal performance in many cases.
2 Background
Consider a finite set of cardinality , which we will also denote by . A DPP on
defines a probability distribution over all its
subsets. It is parameterized by a matrix , called the kernel, such that the probability of each subset is proportional to the determinant of its corresponding principal submatrix: . The normalization constant for this distribution can be expressed as a single determinant: (Kulesza et al., 2012, Theorem 2.1). Hence, . We will use to denote this distribution.For intuition about the kernel parameters, notice that the probabilities of singletons and are proportional to and , respectively. Hence, it is common to think of ’s diagonal as representing item qualities. The probability of the set is proportional to . Thus, if , then and are positively correlated. Similarly, if , then and are negatively correlated. Therefore, off-diagonal terms determine item correlations.
In order to ensure that defines a probability distribution, all principal minors of must be non-negative: . Matrices that satisfy this property are called -matrices (Fang, 1989, Definition 1). There is no known generation method or matrix decomposition that fully covers the space of all matrices, although there are many that partially cover the space Tsatsomeros (2004).
One common partial solution is to use a decomposition that covers the space of symmetric matrices. By restricting to the space of symmetric matrices, one can exploit the fact that if and only if is positive semidefinite (PSD) Prussing (1986). (Recall that a matrix is defined to be PSD if and only if , for all
.) Any symmetric PSD matrix can be written as the Gramian matrix of some set of vectors:
, where . Hence, the decomposition provides an easy means of generating the entire space of symmetric matrices. It also has a nice intuitive interpretation, if we view the -th row of as a length- feature vector describing item .Unfortunately, the symmetry requirement severely limits the types of correlations that a DPP can capture. A symmetric model is able to capture only nonpositive correlations between items, since , whereas a nonsymmetric can capture both negative and positive correlations (see (Gartrell et al., 2019, Section 2.1) for more intuition). To expand coverage to nonsymmetric matrices in , it’s natural to consider nonsymmetric PSD matrices. In what follows, we denote by the set of all nonsymmetric (and symmetric) PSD matrices. Any nonsymmetric PSD matrix is in (Gartrell et al., 2019, Lemma 1), so . However, unlike in the symmetric case, the set of nonsymmetric PSD matrices does not fully cover the set of nonsymmetric matrices. For example, consider
Still, nonsymmetric PSD matrices cover a large enough portion of the space to be useful in practice, as evidenced by the experiments of Gartrell et al. (2019). This work covered the space by using the following decomposition: , with for , and for . This decomposition makes use of the fact that any matrix can be decomposed uniquely as the sum of a symmetric matrix
and a skew-symmetric matrix
. (To see this, let and .) All skew-symmetric matrices are trivially PSD, since for all . Hence, the here is guaranteed to be PSD simply because its uses the standard Gramian decomposition .In this work we will also only consider , and leave to future work the problem of finding tractable ways to cover the rest of . We propose a new decomposition of that allows for more scalable learning. As in prior DPP work, our decomposition has inner dimension that could be as large as , but is usually much smaller in practice. Our algorithms work well for modest values of . In cases where the natural
is larger (e.g., natural language processing applications where
might be the number of words), random projections can often be used to significantly reduce Gillenwater et al. (2012a).3 New kernel decomposition and scalable learning
Prior work on NDPPs proposed a maximum likelihood estimation (MLE) learning algorithm
Gartrell et al. (2019). Due to that work’s particular kernel decomposition, this learning approach has complexity cubic in the number of items . Here, we propose an alternative decomposition that reduces the complexity to linear in .We begin by showing that our new decomposition covers the space of matrices. Before diving in, let us define as shorthand for a block matrix with zeros on-diagonal and opposite values off-diagonal. Then, our proposed decomposition is as follows:
(1) |
where , and is a block-diagonal matrix with some diagonal blocks of the form , with , and zeros elsewhere. The following lemma shows that this decomposition covers the space of matrices.
Lemma 1.
For , let be an even integer and let be a skew-symmetric matrix with rank . Then, there exist and positive numbers , such that , where is the block-diagonal matrix with diagonal blocks of size 2 given by , .
The proof of lemma:skew-sym-decomp and all subsequent results can be found in sec:proof. With this decomposition in hand, we now proceed to show that it can be used for linear-time MLE learning. To do so, we must show that corresponding NDPP log-likelihood objective and gradient can be computed in time linear in . Given a collection of observed subsets composed of items from , the full formulation of the regularized log-likelihood is:
(2) |
where denotes a matrix composed of the rows of that correspond to the items in . The regularization term, , is defined as follows:
(3) |
where counts the number of occurrences of item in the training set, and are rows of and , respectively, and
are tunable hyperparameters. This regularization is similar to that of prior work
Gartrell et al. (2017, 2019). We omit regularization for since regularization based on item counts (popularity) cannot be directly applied to . Furthermore, we observe from our experiments that regularization on does not seem to be needed in practice.theorem:loglik-complexity,theorem:grad-complexity show that computing the regularized log-likelihood and its gradient both have time complexity linear in . The complexities also depend on , the rank of the NDPP, and , the size of the largest observed subset in the data. In practice though, usually and . Hence, linearity in means that we can efficiently perform learning for datasets with very large ground sets, which is impossible with the cubic-complexity decomposition in prior work Gartrell et al. (2019). We briefly note here that a key component in the proof of theorem:loglik-complexity is use of the Woodbury matrix identity. This allows us to transform an operation into an one, by replacing the computation of a size- with matrix multiplications and a size- . A similar approach was also recently adopted in work on normalizing flows for deep generative models Lu and Huang (2020).
Theorem 1.
Given a nonsymmetric low-rank DPP parameterized by of rank , of rank , and a matrix , where , we can compute the regularized log-likelihood, eq:nonsymm-full-log-likelihood, in time, where is the size of the largest of the training subsets.
Theorem 2.
Given a nonsymmetric low-rank DPP parameterized by of rank , of rank , and a matrix , where , we can compute the gradient of the regularized log-likelihood in eq:nonsymm-full-log-likelihood in time, where is the size of the largest of the training subsets.
To further simplify learning and MAP inference, we set , which results in . This change also simplifies regularization, so that we only perform regularization on , as indicated in the first term of eq:regularization, leaving us with the single regularization hyperparameter of . While setting restricts the class of nonsymmetric kernels that can be represented, we find in practice that this does not adversely impact prediction quality.
To compensate for the restriction imposed by setting , we also relax the block-diagonal structure imposed on , so that we learn a full skew-symmetric matrix . We empirically observe that this relaxation of the block-diagonal structure is needed to ensure learning of nonsymmetric kernels with better predictive performance. To ensure that and thus is skew-symmetric, we set , where .
4 MAP inference
After learning an NDPP, one can then use it to infer the most probable item subsets in various situations. Several inference algorithms have been well-studied for symmetric DPPs, including sampling Kulesza and Taskar (2011); Anari et al. (2016); Li et al. (2016); Launay et al. (2018); Gillenwater et al. (2019); Poulson (2019); Dereziński (2019) and MAP inference Gillenwater et al. (2012b); Han et al. (2017); Chen et al. (2018); Han and Gillenwater (2020). We focus on MAP inference:
(4) |
for cardinality budget . MAP inference for DPPs is known to be NP-hard even in the symmetric case Ko et al. (1995); Kulesza et al. (2012). For symmetric DPPs, one usually approximates the MAP via the standard greedy algorithm for submodular maximization Nemhauser et al. (1978). First, we describe how to efficiently implement this for NDPPs. Then, in sec:map-approx-guarantee we prove a lower bound on its approximation quality. To the best of our knowledge, this is the first investigation of how to apply the greedy algorithm to NDPPs.
Greedy begins with an empty set and repeatedly adds the item that maximizes the marginal gain until the chosen set is size . Here, we aim to design an efficient greedy algorithm when the NDPP kernel is given by a low-rank matrix. For generality, in what follows we write the kernel as , since one can easily rewrite our matrix decomposition, as well as that of Gartrell et al. (2019), to take this form. For example, for our decomposition: . Using Schur’s determinant identity, we first observe that, for and , the marginal gain of a NDPP can be written as
(5) |
where and . A naïve computation of eq:bilinear-marginalgain is , since we must invert a matrix, where . However, one can compute eq:bilinear-marginalgain more efficiently by observing that it actually can be expressed as a rank- matrix and hence computed in time.
Lemma 2.
Given , , and , let be the -th row in and be a matrix containing rows in indexed by . Then, it holds that
(6) |
where row vectors for satisfy , , and
(7) |
Plugging eq:bilinear-inverse into eq:bilinear-marginalgain, the marginal gain with respect to can be computed by simply updating from the previous gain with respect to . That is,
(8) | ||||
(9) |
The marginal gains when are equal to diagonals of and require operations. Then, computing the update terms in eq:marginalgainupdate for all needs operations. Since the total number of updates is , the overall complexity becomes . We provide a full description of the implied greedy algorithm for low-rank NDPPs in alg:greedylowrank.
Low-rank DPP Models | MLE Learning | MAP Inference |
Symmetric DPP Gartrell et al. (2017) | ||
Nonsymmetric DPP Gartrell et al. (2019) | ||
Scalable nonsymmetric DPP (this work) |
tab:runtimes summarizes the runtime complexitiy of our methods and those of previous work. We also note that memory required for MAP inference is in all cases. For learning, the memory required is for the NDPPs from prior work Gartrell et al. (2019), but only for low-rank symmetric DPPs Gartrell et al. (2017) and our proposed scalable NDPPs.
4.1 Approximation guarantee for greedy NDPP MAP inference
As mentioned above, alg:greedylowrank is an instantiation of the standard greedy algorithm used for submodular maximization Nemhauser et al. (1978). This algorithm has a -approximation guarantee for the problem of maximizing nonnegative, monotone submodular functions. While the function is submodular for a symmetric PSD Kelmans and Kimelfeld (1983), it is not monotone. Often, as in Han and Gillenwater (2020)
, it is assumed that the smallest eigenvalue of
is greater than , which guarantees montonicity. There is no particular evidence that this assumption is true for typical practical models, but nevertheless the greedy algorithm tends to perform well in practice for symmetric DPPs. Here, we prove a similar approximation guarantee that covers NDPPs as well, even though the function is non-submodular when is nonsymmetric. In sec:mapinference, we further observe that, as for symmetric DPPs, the greedy algorithm seems to work well in practice for NDPPs.Recently, Bian et al. (2017) proposed an extension of greedy algorithm guarantees to non-submodular functions. Their result is based on the submodularity ratio and curvature of the objective function, which measure to what extent the objective has submodular properties. Leveraging this result, in thm:greedy we provide an approximation ratio for greedy MAP inference of NDPPs.
Theorem 3.
Consider a nonsymmetric low-rank DPP , where are of rank , and . Given a cardinality budget , let and denote the smallest and largest singular values of
(10) |
where is the output of alg:greedylowrank and is the optimal solution of MAP inference in eq:mapinference.
Thus, when the kernel has a small value of , the greedy algorithm finds a near-optimal solution. In practice, we observe that the greedy algorithm finds a near-optimal solution even for large values of this ratio (see sec:mapinference). As remarked above, there is no evidence that the condition is usually true in practice. While this condition can be achieved by multiplying by a constant, this leads to a (potentially large) additive term in eq:greedyratio. We provide cor:greedy in sec:corr-greedy, which excludes the assumption, and quantifies this additive term.
4.2 Greedy conditioning for next-item prediction
We briefly describe here a small modification to the greedy algorithm that is necessary if one wants to use it as a tool for next-item prediction. Given a set , Kulesza et al. (2012) showed that a DPP with conditioned on the inclusion of the items in the set forms another DPP with kernel where . The singleton probability can be useful for doing next-item prediction. We can use the same machinery from the greedy algorithm’s marginal gain computations to effectively compute these singletons. More concretely, suppose that we are doing next-item prediction as a shopper adds items to a digital cart. We predict the item that maximizes the marginal gain, conditioned on the current cart contents (the set in the greedy algorithm). When the shopper adds the next item to their cart, we update to include this item, rather than our predicted item (line 10 in alg:greedylowrank). We then iterate until the shopper checks out. The comments on the righthand side of alg:greedylowrank summarize this procedure. The runtime of this prediction is the same that of the greedy algorithm, . We note that this cost is comparable to that of an approach based on the DPP dual kernel from prior work Mariet et al. (2019), which has complexity. However, since it is non-trivial to define the dual kernel for NDPPs, the greedy algorithm may be the simpler choice for next-item prediction for NDPPs.
5 Experiments
Code for all experiments is available at
https://github.com/cgartrel/nonsymmetric-DPP-learning/tree/scalable.
5.1 Datasets
We perform experiments on several real-world public datasets composed of online shopping baskets:
-
Amazon Baby Registries: This dataset consists of registries or "baskets" of baby products, and has been used in prior work on DPP learning Gartrell et al. (2016, 2019); Gillenwater et al. (2014); Mariet and Sra (2015). The registries contain items from 15 different categories, such as “apparel”, with a catalog of up to 100 items per category. Our evaluation mirrors that of Gartrell et al. (2019); we evaluate on the popular apparel category, which contains 14,970 registries, as well as on a dataset composed of the three most popular categories: apparel, diaper, and feeding, which contains a total of 31,218 registries.
-
UK Retail: This dataset Chen et al. (2012) contains baskets representing transactions from an online retail company that sells unique all-occasion gifts. We omit baskets with more than 100 items, leaving us with a dataset containing 19,762 baskets drawn from a catalog of 3,941 products. Baskets containing more than 100 items are in the long tail of the basket-size distribution of the data, so omitting larger baskets is reasonable, and allows us to use a low-rank factorization of the DPP with .
-
Instacart: This dataset Instacart (2017) contains baskets purchased by Instacart users. Again, we omit baskets with more than 100 items, leaving us with 3.2 million baskets and a catalog of 49,677 products.
5.2 Experimental setup and metrics
We use a small held-out validation set, consisting of 100 randomly-selected baskets, for tracking convergence during training and for tuning hyperparameters. A random selection of 2000 of the remaining baskets are used for testing, and the rest are used for training. Convergence is reached during training when the relative change in validation log-likelihood is below a predetermined threshold. We use PyTorch with Adam
Kingma and Ba (2015) for optimization.Subset expansion task. We use greedy conditioning to do next-item prediction, as described in subsec:greedy-conditioning. We compare methods using a standard recommender system metric: mean percentile rank (MPR) Hu et al. (2008); Li et al. (2010). MPR of 50 is equivalent to random selection; MPR of 100 means that the model perfectly predicts the next item. See sec:MPR for a complete description of the MPR metric.
Subset discrimination task. We also test the ability of a model to discriminate observed subsets from randomly generated ones. For each subset in the test set, we generate a subset of the same length by drawing items uniformly at random (and we ensure that the same item is not drawn more than once for a subset). We compute the AUC for the model on these observed and random subsets, where the score for each subset is the log-likelihood that the model assigns the subset.
|
|
|||||||
Metric | Sym | Nonsym | Scalable nonsym | Sym | Nonsym | Scalable nonsym | ||
MPR | 77.42 1.12 | 80.32 0.75 | 84.86 1.51 | 60.61 0.94 | 75.09 0.85 | 75.08 1.64 | ||
AUC | 0.66 0.01 | 0.73 0.01 | 0.71 0.03 | 0.70 0.01 | 0.79 0.01 | 0.80 0.04 | ||
test log-likelihood | -10.31 | -9.66 | -9.64 | -18.11 | -17.02 | -17.22 |
|
|
|||||||
Metric | Sym | Nonsym | Scalable nonsym | Sym | Nonsym | Scalable nonsym | ||
MPR | 76.79 0.60 | 79.45 0.57 | 81.64 1.22 | 91.01 1.22 | - | 92.90 0.49 | ||
AUC | 0.57 0.001 | 0.65 0.01 | 0.76 0.02 | 0.83 0.01 | - | 0.84 0.01 | ||
test log-likelihood | -120.47 | -108.67 | -106 | -70.81 | - | -69.12 |
, and our scalable NDPP models. MPR and AUC results show 95% confidence estimates obtained via bootstrapping. Bold values indicate improvement over the symmetric low-rank DPP outside of the confidence interval. See sec:hyperparams for the hyperparameter settings used in these experiments. The baseline NDPP model cannot be feasibly trained on Instacart, as memory and computational costs are prohibitive due to its large ground set size.
5.3 Predictive performance results for learning
Since the focus of our work is on improving NDPP scalability, we use the low-rank symmetric DPP Gartrell et al. (2017) and the low-rank NDPP of prior work Gartrell et al. (2019) as baselines for our experiments. tab:predictive-qual compares these DPP approaches and our scalable low-rank NDPP. We see that our scalable NDPP matches or exceeds the predictive quality of the baseline NDPP. Notice that our scalable NDPP opens to the door to training on datasets with large , such as the Instacart dataset, which is infeasible for the baseline NDPP due to very high memory and compute costs. As expected, we also empirically observe that the scalable NDPP trains far faster than the NDPP for datasets with large ground sets. For example, the per-iteration gradient update of scalable NDPP is faster than that of the decomposition of Gartrell et al. (2019) on the UK dataset. See sec:training-time for a comparison of overall training times.
5.4 Performance results for MAP inference
We run various approximatation algorithms for MAP inference, including the greedy algorithm (alg:greedylowrank), stochastic greedy algorithm Mirzasoleiman et al. (2015), MCMC-based DPP sampling Li et al. (2016), and greedy local search Kathuria and Deshpande (2016). The stochastic greedy algorithm computes marginal gains of a few items chosen uniformly at random and selects the best among them. The MCMC sampling begins with a random subset of size and picks and uniformly at random. Then, it swaps them with probability and iterates this process. The greedy local search algorithm Kathuria and Deshpande (2016) starts from the output from the greedy algorithm, , and replaces with that gives the maximum improvement, if such exist. This replacement process iterates until no improvement exists, or at most steps have been completed, to guarantee a tight approximation Kathuria and Deshpande (2016). We use greedy local search as a baseline since it always returns a better solution than greedy. However, it is the slowest among all algorithms, as its time complexity is . We choose , and provide more details of all algorithms in sec:benchmark.
To evaluate the performance of MAP inference, we report the relative log-determinant ratio, defined as where is the output of benchmark algorithms and is the greedy local search result. Results are reported in tab:mapinference. We observe that the greedy algorithm achieves performance close to that of the significantly more expensive greedy local search algorithm, with relative errors of up to . Stochastic greedy and MCMC sampling have significantly larger errors.
Algorithms |
|
|
|
|
||||
---|---|---|---|---|---|---|---|---|
Greedy (alg:greedylowrank) | 0.0196 0.0020 | 0.0613 0.0027 | 0.0498 0.0017 | 0.0085 0.0005 | ||||
Stochastic greedy Mirzasoleiman et al. (2015) | 0.1296 0.0041 | 0.1716 0.0041 | 0.1526 0.0028 | 0.1408 0.0044 | ||||
MCMC sampling Li et al. (2016) | 0.5437 0.0083 | 0.7640 0.0092 | 0.8930 0.0085 | 3.2646 0.0401 |
5.5 Performance guarantee for greedy MAP inference
The matrices learned on real datasets are too large to compute the exact MAP solution, but we can compute exact MAP for small matrices. In this section, we explore the performance of the greedy algorithm studied in thm:greedy for synthetic kernel matrices. More formally, we first pick singular values from a kernel learned for the “Amazon: 3-category” dataset (a plot of these singular values can be seen in fig:approxratio:singular) and generate , where are random orthonormal matrices. To ensure that is a matrix, we repeatedly sample until all principal minors of are nonnegative. We also evaluate the performance of the symmetric DPP, where the kernel matrices are generated similarly to the NDPP, except we set . We set and generate random kernels for both symmetric DPPs and NDPPs.
The results for symmetric and nonsymmetric DPPs are shown in fig:approxratio:sym and fig:approxratio:nonsym, respectively. We plot the approximation ratio of alg:greedylowrank, i.e., , with respect to , from cor:greedy. We observe that the greedy algorithm for both often shows approximation ratios close to . However, the worst-case ratio for NDPPs is worse than that of symmetric DPPs; for is non-submodular, and the greedy algorithm with a nonsubmodular function does not have as tight of a worst-case bound as in the symmetric case.
![]() |
![]() |
![]() |
6 Conclusion
We have presented a new decomposition for nonsymmetric DPP kernels that can be learned in time linear in the size of the ground set, which is a significant improvement over the cubic complexity of prior work. Empirical results indicate that this decomposition matches or exceeds the predictive performance of the prior decomposition. We have also proved a lower bound on the quality of the greedy MAP approximation algorithm for nonsymmetric DPPs, and showed how to implement it efficiently. For future work, we will investigate tightening the performance bound for the greedy MAP algorithm, work on developing intuition about the meaning of the parameters in the matrix, and consider kernel decompositions that cover other parts of the nonsymmetric space.
Broader Impact
In general, we feel that our work moves in a positive direction by decreasing the storage and computation costs of learning NDPPs. However, in terms of broader impact, concerns related to this work are similar to those of other recommender system work (Milano et al., 2020). When applying our methods to learn kernels from user data, we recommend employing a technique such as differentially-private SGD (Abadi et al., 2016) to help prevent user data leaks, and adjusting the weights on training examples to balance the impact of sub-groups of users so as to make the final kernel as fair as possible.
References
- [1] (2016) Deep Learning with Differential Privacy. In Conference on Computer and Communications Security, Cited by: Broader Impact.
- [2] (2016) Monte Carlo Markov Chain Algorithms for Sampling Strongly Rayleigh Distributions and Determinantal Point Processes. In Conference on Learning Theory (COLT), Cited by: §4.
- [3] (2017) Guarantees for Greedy Maximization of Non-submodular Functions with Applications. In International Conference on Machine Learning (ICML), Cited by: §F.5, §4.1, Theorem 4.
- [4] (2019) Batch Active Learning Using Determinantal Point Processes. arXiv:1906.07975. Cited by: §1.
- [5] (2018) Learning Signed Determinantal Point Processes through the Principal Minor Assignment Problem. In Neural Information Processing Systems (NeurIPS), Cited by: §1.
- [6] (2012) Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining. Journal of Database Marketing & Customer Strategy Management. Cited by: item 2.
- [7] (2018) Fast greedy MAP inference for Determinantal Point Process to improve recommendation diversity. In Neural Information Processing Systems (NeurIPS), Cited by: §4.
- [8] (2019) Fast determinantal point processes via distortion-free intermediate sampling. In Conference on Learning Theory (COLT), Cited by: §4.
- [9] (2019) GDPP: Learning Diverse Generations using Determinantal Point Processes. In International Conference on Machine Learning (ICML), Cited by: §1.
- [10] (1989) On the Spectra of - and -Matrices. In Linear Algebra and its Applications, Cited by: §2.
- [11] (2019) Learning Nonsymmetric Determinantal Point Processes. In Neural Information Processing Systems (NeurIPS), Cited by: Appendix B, Figure 2, Appendix C, §1, §1, §2, §3, §3, §3, Table 1, §4, §4, item 1, §5.3, Table 2.
- [12] (2016) Bayesian low-rank determinantal point processes. In Conference on Recommender Systems (RecSys), Cited by: Appendix B, item 1.
-
[13]
(2017)
Low-Rank Factorization of Determinantal Point Processes.
In
Conference on Artificial Intelligence (AAAI)
, Cited by: Appendix B, §3, Table 1, §4, §5.3, Table 2. - [14] (2014) Expectation-Maximization for learning Determinantal Point Processes. In Neural Information Processing Systems (NIPS), Cited by: item 1.
- [15] (2019) A Tree-Based Method for Fast Repeated Sampling of Determinantal Point Processes. In International Conference on Machine Learning (ICML), Cited by: §4.
- [16] (2012) Discovering Diverse and Salient Threads in Document Collections. In Empirical Methods in Natural Language Processing (EMNLP), Cited by: §2.
- [17] (2012) Near-Optimal MAP Inference for Determinantal Point Processes. In Neural Information Processing Systems (NIPS), Cited by: §4.
- [18] (2020) MAP Inference for Customized Determinantal Point Processes via Maximum Inner Product Search. In Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §4.1, §4.
- [19] (2017) Faster greedy MAP inference for determinantal point processes. In International Conference on Machine Learning (ICML), Cited by: §4.
- [20] (2008) Collaborative Filtering for Implicit Feedback Datasets. In International Conference on Data Mining (ICDM), Cited by: §5.2.
- [21] (2017) The Instacart Online Grocery Shopping Dataset. Note: Accessed May 2020 External Links: Link Cited by: item 3.
- [22] (2016) On sampling and greedy map inference of constrained determinantal point processes. arXiv preprint arXiv:1607.01551. Cited by: Appendix D, Table 4, §5.4.
- [23] (1983) Multiplicative submodularity of a matrix’s principal minor as a function of the set of its rows and some combinatorial applications. Discrete Mathematics. Cited by: §4.1.
- [24] (2015) Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR), Cited by: §5.2.
- [25] (1995) An Exact Algorithm for Maximum Entropy Sampling. Operations Research. Cited by: §4.
- [26] (2012) Determinantal Point Processes for Machine Learning. Foundations and Trends® in Machine Learning. Cited by: §1, §2, §4.2, §4.
- [27] (2011) Learning determinantal point processes. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: Appendix D, §4.
- [28] (2018) Exact Sampling of Determinantal Point Processes without Eigendecomposition. arXiv preprint arXiv:1802.08429. Cited by: §4.
- [29] (2016) Fast DPP Sampling for Nystrom with Application to Kernel Methods. In International Conference on Machine Learning (ICML), Cited by: Appendix D, Table 4, §1, §4, §5.4, Table 3.
- [30] (2010) Improving One-class Collaborative Filtering by Incorporating Rich User Information. In Conference on Information and Knowledge Management (CIKM), Cited by: §5.2.
- [31] (2020) Woodbury Transformations for Deep Generative Flows. arXiv preprint arXiv:2002.12229. Cited by: §F.2, §3.
- [32] (2019) Learning Determinantal Point Processes by Sampling Inferred Negatives. In Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §4.2.
- [33] (2015) Fixed-point algorithms for learning Determinantal Point Processes. In International Conference on Machine Learning (ICML), Cited by: item 1.
- [34] (2016) Diversity Networks: Neural Network Compression Using Determinantal Point Processes. In International Conference on Learning Representations (ICLR), Cited by: §1.
- [35] (2020) Recommender Systems and Their Ethical Challenges. AI & SOCIETY. Cited by: Broader Impact.
- [36] (2015) Lazier Than Lazy Greedy. In Conference on Artificial Intelligence (AAAI), Cited by: Appendix D, Table 4, §5.4, Table 3.
- [37] (1978) An Analysis of Approximations for Maximizing Submodular Set Functions I. Mathematical Programming 14 (1). Cited by: §4.1, §4.
- [38] (2019) High-performance sampling of generic Determinantal Point Processes. arXiv preprint arXiv:1905.00165. Cited by: §4.
- [39] (1986) The Principal Minor Test for Semidefinite Matrices. Journal of Guidance, Control, and Dynamics. Cited by: §2.
-
[40]
(2018)
Improving Sequential Determinantal Point Processes for Supervised Video Summarization.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, Cited by: §1. - [41] (1988) Normal forms for skew-symmetric matrices and Hamiltonian systems with first integrals linear in momenta. In Proceedings of the American Mathematical Society, Cited by: §F.1.
- [42] (1972) Principal submatrices IX: Interlacing inequalities for singular values of submatrices. Linear Algebra and its Applications. Cited by: Theorem 5.
- [43] (2004) Generating and Detecting Matrices with Positive Principal Minors. In Focus on Computational Neurobiology, Cited by: §2.
- [44] (2018) Practical Diversified Recommendations on YouTube with Determinantal Point Processes. In Conference on Information and Knowledge Management (CIKM), Cited by: §1.
- [45] (2017) Determinantal Point Processes for Mini-Batch Diversification. In Conference on Uncertainty in Artificial Intelligence (UAI), Cited by: §1.
Appendix A Mean Percentile Rank
We begin our definition of MPR by defining percentile rank (PR). First, given a set , let . The percentile rank of an item given a set is defined as
where indicates those elements in the ground set that are not found in .
For our evaluation, given a test set , we select a random element and compute . We then average over the set of all test instances to compute the mean percentile rank (MPR):
Appendix B Hyperparameters for experiments in tab:predictive-qual
Preventing numerical instabilities: The first term on the right side of eq:nonsymm-full-log-likelihood will be singular whenever , where is an observed subset. Therefore, to address this in practice we set to the size of the largest subset observed in the data, , as in [13]. However, this does not entirely address the issue, as the first term on the right side of eq:nonsymm-full-log-likelihood may still be singular even when . In this case though, we know that we are not at a maximum, since the value of the objective function is . Numerically, to prevent such singularities, in our implementation we add a small correction to each when optimizing eq:nonsymm-full-log-likelihood (we set in our experiments).
We perform a grid search using a held-out validation set to select the best performing hyperparameters for each model and dataset. The hyperparameter settings used for each model and dataset are described below.
Symmetric low-rank DPP [12]. For this model, we use for the number of item feature dimensions for the symmetric component , and for the regularization hyperparameter for . We use the following hyperparameter settings:
-
Both Amazon datasets: .
-
UK Retail dataset: .
-
Instacart dataset: .
Baseline NDPP [11]. For this model, to ensure consistency with the notation used in [11], we use to denote the number of item feature dimensions for the symmetric component , and to denote the number of item feature dimensions for the nonsymmetric components, and . As described in [11], is the regularization hyperparameter for the , while and are the regularization hyperparameters for and , respectively. We use the following hyperparameter settings:
-
Both Amazon datasets: .
-
Amazon apparel dataset: .
-
Amazon three-category dataset: .
-
UK Retail dataset: .
-
All datasets: .
Scalable NDPP. As described in sec:model, we use to denote the number of item feature dimensions for the symmetric component and the dimensionality of the nonsymmetric component . is the regularization hyperparameter. We use the following hyperparameter settings:
-
Amazon apparel dataset: .
-
Amazon three-category dataset: .
-
UK dataset: .
-
Instacart dataset: .
For all of the above model configurations we use a batch size of 200 during training, except for the scalable NDPPs trained on the Amazon apparel and Instacart datasets, where a batch size of 400 is used.
Appendix C Training time
In fig:training-time, we report the wall-clock training time of the decomposition of [11] (NDPP) and our scalable NDPP for the Amazon: 3-category (fig:training-time:amazonthree) and UK Retail (fig:training-time:ukretail) datasets. For the Amazon: 3-category dataset, both approaches show comparable results, with the scalable NDPP converging times faster than NDPP. But for the UK Retail dataset, which has a much larger ground set, our scalable NDPP achieves convergence about times faster. We do not have a timing comparison for the Instacart dataset because the model with the decomposition of [11] cannot be trained on this dataset due to prohibitive memory and computational costs.
![]() |
![]() |
Appendix D Benchmark algorithms for MAP inference
We test following approximate algorithms for MAP inference:
Greedy local search.
This algorithm starts from the output of greedy, , and replaces with that gives the maximum improvement of the determinant, if such exist. Kathuria and Deshpande [22] showed that running the search for such a swap times with an accuracy parameter gives a tight approximation guarantee for MAP inference for symmetric DPPs. We set the number of swaps to for and use greedy local search as a baseline, since it is strictly an improvement on the greedy solution. However, greedy local search requires operations, and thus it is the slowest among all of our baseline algorithms.
Stochastic greedy.
This algorithm computes marginal gains of a few items chosen uniformly at random and selects the best among them. [36] proved that samples are enough to guarantee an approximation ratio for submodular functions (i.e., symmetric DPPs). We choose and set the number of samples to . Under this setting, the time complexity of stochastic greedy is , which is better than the naïve exact greedy algorithm. However, we note that it is worse than that of our efficient greedy implement (alg:greedylowrank). This is because the stochastic greedy uses different random samples for every iteration and this does not take advantage of the amortized computations in lmm:bilinear-inverse. In our experiments, we simply modify line 10 in alg:greedylowrank for stochastic greedy ( is operated on a random subset of marginal gains), hence it can run in time. In practice, we observe that stochastic greedy is slightly slower than exact greedy due to the additional costs of random sampling process.
MCMC sampling.
We also compare inference algorithms with sampling from a DPP with fixed size (known as a -DPP). Exact sampling [27] requires eigendecomposition of , which is infeasible for a large
. To resolve this, Markov Chain Monte-Carlo (MCMC) based sampling is preferred. In particular, we consider the Metropolis-Hastings algorithm, which begins with a random subset
with size , and picks and uniformly at random. Then, it swaps them with probability(11) |
and repeats this process for several steps. Recent work [29] shows that the MCMC sampling provides promising results for kernel approximation. We set the number of swaps to (the same as for greedy local search), for a runtime complexity of , which is better than the greedy algorithm.
We provide the wall-clock time of the above algorithms for real-world datasets in tab:mapinferencetime. Observe that the greedy algorithm is the fastest method for all datasets except Instacart. For Instacart, MCMC sampling is faster than other approaches, but it has much larger relative errors in terms of log-determinant (see tab:mapinference), which is not suitable for our purposes.
Algorithms |
|
|
|
|
||||
---|---|---|---|---|---|---|---|---|
Greedy local search [22] | 4.28 ms | 7.11 ms | 36.92 ms | 468.50 ms | ||||
Greedy (alg:greedylowrank) | 0.19 ms | 0.27 ms | 1.88 ms | 21.86 ms | ||||
Stochastic greedy [36] | 0.26 ms | 0.38 ms | 2.03 ms | 24.40 ms | ||||
MCMC sampling [29] | 6.30 ms | 6.44 ms | 10.88 ms | 16.86 ms |
Appendix E Corollary of thm:greedy
thm:greedy requires the technical condition but in practice there is no particular evidence that this condition holds. While this condition can be achieved by multiplying by a constant, this leads to a (potentially large) additive term in eq:greedyratio. Here, we provide cor:greedy which excludes the assumption from thm:greedy, and quantifies this additive term.
Corollary 1.
Consider a nonsymmetric low-rank DPP , where are of rank , and . Given a cardinality budget , let and denote the smallest and largest singular values of for all and . Let . Then,
(12) |
where is the output of alg:greedylowrank and is the optimal solution of MAP inference in eq:mapinference.
The proof of cor:greedy is provided in sec:proof-corr-greedy. Note that instead of , cor:greedy has a term in the denominator.
Appendix F Proofs
f.1 Proof of lemma:skew-sym-decomp
See 1
Proof.
Let be the submatrix of obtained by keeping its first rows and columns and let , where is the identity matrix. Then, and one can write . Setting proves the lemma. ∎
f.2 Proof of theorem:loglik-complexity
See 1
Proof.
To show that this log-likelihood can be computed in time linear in , we first show that the DPP normalization term can be computed in linear time. We briefly note here that a key component in the proof of the normalization term complexity is use of the Woodbury matrix identity, which allows us to transform an operation into an one. A similar approach was also recently adopted in normalizing flow for deep generative models [31].
First, assume that is invertible. Otherwise, replace with for some small . Then we have:
(13) | |||
(14) | |||
(15) | |||
(16) |
eq:norm-com-proof-0 follows from application of the generalized form of the matrix determinant lemma. eq:norm-com-proof-1 follows from application of the Weinstein–Aronszajn identity. eq:norm-com-proof-2 follows from application of Woodbury’s matrix identity. If we analyze the final form in eq:norm-com-proof-3, we see that it consists of three determinants of matrices. Thus, once the internal matrices are computed, the cost of the determinants is . The most expensive internal matrix computation is . Computing takes operations. Computing the inverse takes operations, and the remaining multiplications by require operations. Therefore, computing requires operations overall.
Having established that the normalization term in the likelihood can be computed in time, we proceed with characterizing the complexity of the other terms in the likelihood. The first term in eq:nonsymm-full-log-likelihood consists of determinants of size . Assuming that these never exceed size , each can be computed in at most time. The regularization term is a simple sum of norms that can be computed in time. Therefore, the full regularized log-likelihood can be computed in time. ∎
f.3 Proof of theorem:grad-complexity
See 2
Proof.
To prove that the gradient of the log-likelihood can be computed in time linear in , we begin by showing that the gradient of the DPP’s normalization term can be computed in linear time.
From eq:norm-com-proof-1 in theorem:loglik-complexity’s proof, we have:
(17) |
The gradient of has three parts: . We derive each below, making use of the standard rule for the gradient of : for any matrix , . If we define , the gradients are:
(18) | ||||
(19) | ||||