Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes

06/17/2020 ∙ by Mike Gartrell, et al. ∙ Google Criteo KAIST 수리과학과 0

Determinantal point processes (DPPs) have attracted significant attention from the machine learning community for their ability to model subsets drawn from a large collection of items. Recent work shows that nonsymmetric DPP kernels have significant advantages over symmetric kernels in terms of modeling power and predictive performance. However, the nonsymmetric kernel learning algorithm from prior work has computational complexity that is cubic in the size of the DPP ground set, from which subsets are drawn, making it impractical to use at large scales. In this work, we propose a new decomposition for nonsymmetric DPP kernels that induces linear-time complexity for learning and approximate maximum a posteriori (MAP) inference. We also prove a lower bound on the quality of this MAP approximation. Through evaluation on real-world datasets, we show that our new decomposition not only scales better, but also matches or exceeds the predictive performance of prior work.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Determinantal point processes (DPPs) have proven useful in numerous machine learning applications. For example, recent uses include summarization Sharghi et al. (2018), recommender systems Wilhelm et al. (2018)

, neural network compression

Mariet and Sra (2016), kernel approximation Li et al. (2016), multi-modal output generation Elfeki et al. (2019), and batch selection, both for stochastic optimization Zhang et al. (2017)

and for active learning

Bıyık et al. (2019). DPPs have been applied in these cases because for each there existed a subset-selection problem, and a good solution to this problem was to select items that were high-quality but also diverse; DPPs provide a means of trading off quality with diversity in a principled way Kulesza et al. (2012).

For subset selection problems where the ground set of items to select from has cardinality , the typical DPP is parameterized by an kernel matrix. Most prior work has been concerned with symmetric DPPs, where the kernel must equal its transpose. However, recent work has considered the more general class of nonsymmetric DPPs (NDPPs) and shown that these have additional useful modeling power Brunel (2018); Gartrell et al. (2019). In particular, NDPPs allow modeling of positive correlations between items, where the presence of item

in the selected set increases the probability that some other item

will also be selected. Symmetric DPPs cannot capture such positive correlations. There are many intuitive examples of how positive correlations can be of practical importance. For example, consider a product recommendation task for a website, where a user has a camera in her shopping cart, and the goal is to display several other items that she might purchase. Relative to an empty cart, the presence of the camera probably increases the probability that the person will buy an accessory like a tripod. Although NDPPs can model such behavior, scalable approaches for NDPP learning and MAP inference have not been studied.

Here, we build on prior NDPP work and make the following contributions:

Learning: We propose a decomposition for NDPP kernels that reduces the complexity of learning from cubic Gartrell et al. (2019) to linear in , the size of the DPP ground set.

MAP inference: For the problem of finding the highest-probability subset under a DPP (the “MAP inference” problem), we analyze a standard greedy subset selection algorithm and show that, for low-rank NDPPs, it can be run in time linear in the size of the DPP ground set. We also describe how the same algorithm can be used to do conditioning for next-item prediction tasks. We then prove a lower bound on greedy MAP performance for all NDPPs.

We combine the above contributions through experiments that involve learning NDPP kernels and applying MAP inference to these kernels for several real-world datasets. These experiments demonstrate that our scalable decomposition matches or exceeds the predictive performance of prior work. We also show that, for small synthetic kernels generated from learned kernels, the proposed greedy MAP algorithm has an approximation quality that often substantially exceeds the lower bound, with near-optimal performance in many cases.

2 Background

Consider a finite set of cardinality , which we will also denote by . A DPP on

defines a probability distribution over all its

subsets. It is parameterized by a matrix , called the kernel, such that the probability of each subset is proportional to the determinant of its corresponding principal submatrix: . The normalization constant for this distribution can be expressed as a single determinant: (Kulesza et al., 2012, Theorem 2.1). Hence, . We will use to denote this distribution.

For intuition about the kernel parameters, notice that the probabilities of singletons and are proportional to and , respectively. Hence, it is common to think of ’s diagonal as representing item qualities. The probability of the set is proportional to . Thus, if , then and are positively correlated. Similarly, if , then and are negatively correlated. Therefore, off-diagonal terms determine item correlations.

In order to ensure that defines a probability distribution, all principal minors of must be non-negative: . Matrices that satisfy this property are called -matrices (Fang, 1989, Definition 1). There is no known generation method or matrix decomposition that fully covers the space of all matrices, although there are many that partially cover the space Tsatsomeros (2004).

One common partial solution is to use a decomposition that covers the space of symmetric matrices. By restricting to the space of symmetric matrices, one can exploit the fact that if and only if is positive semidefinite (PSD) Prussing (1986). (Recall that a matrix is defined to be PSD if and only if , for all

.) Any symmetric PSD matrix can be written as the Gramian matrix of some set of vectors:

, where . Hence, the decomposition provides an easy means of generating the entire space of symmetric matrices. It also has a nice intuitive interpretation, if we view the -th row of as a length- feature vector describing item .

Unfortunately, the symmetry requirement severely limits the types of correlations that a DPP can capture. A symmetric model is able to capture only nonpositive correlations between items, since , whereas a nonsymmetric can capture both negative and positive correlations (see (Gartrell et al., 2019, Section 2.1) for more intuition). To expand coverage to nonsymmetric matrices in , it’s natural to consider nonsymmetric PSD matrices. In what follows, we denote by the set of all nonsymmetric (and symmetric) PSD matrices. Any nonsymmetric PSD matrix is in  (Gartrell et al., 2019, Lemma 1), so . However, unlike in the symmetric case, the set of nonsymmetric PSD matrices does not fully cover the set of nonsymmetric matrices. For example, consider

Still, nonsymmetric PSD matrices cover a large enough portion of the space to be useful in practice, as evidenced by the experiments of Gartrell et al. (2019). This work covered the space by using the following decomposition: , with for , and for . This decomposition makes use of the fact that any matrix can be decomposed uniquely as the sum of a symmetric matrix

and a skew-symmetric matrix

. (To see this, let and .) All skew-symmetric matrices are trivially PSD, since for all . Hence, the here is guaranteed to be PSD simply because its uses the standard Gramian decomposition .

In this work we will also only consider , and leave to future work the problem of finding tractable ways to cover the rest of . We propose a new decomposition of that allows for more scalable learning. As in prior DPP work, our decomposition has inner dimension that could be as large as , but is usually much smaller in practice. Our algorithms work well for modest values of . In cases where the natural

is larger (e.g., natural language processing applications where

might be the number of words), random projections can often be used to significantly reduce  Gillenwater et al. (2012a).

3 New kernel decomposition and scalable learning

Prior work on NDPPs proposed a maximum likelihood estimation (MLE) learning algorithm 

Gartrell et al. (2019). Due to that work’s particular kernel decomposition, this learning approach has complexity cubic in the number of items . Here, we propose an alternative decomposition that reduces the complexity to linear in .

We begin by showing that our new decomposition covers the space of matrices. Before diving in, let us define as shorthand for a block matrix with zeros on-diagonal and opposite values off-diagonal. Then, our proposed decomposition is as follows:

(1)

where , and is a block-diagonal matrix with some diagonal blocks of the form , with , and zeros elsewhere. The following lemma shows that this decomposition covers the space of matrices.

Lemma 1.

For , let be an even integer and let be a skew-symmetric matrix with rank . Then, there exist and positive numbers , such that , where is the block-diagonal matrix with diagonal blocks of size 2 given by , .

The proof of lemma:skew-sym-decomp and all subsequent results can be found in sec:proof. With this decomposition in hand, we now proceed to show that it can be used for linear-time MLE learning. To do so, we must show that corresponding NDPP log-likelihood objective and gradient can be computed in time linear in . Given a collection of observed subsets composed of items from , the full formulation of the regularized log-likelihood is:

(2)

where denotes a matrix composed of the rows of that correspond to the items in . The regularization term, , is defined as follows:

(3)

where counts the number of occurrences of item in the training set, and are rows of and , respectively, and

are tunable hyperparameters. This regularization is similar to that of prior work 

Gartrell et al. (2017, 2019). We omit regularization for since regularization based on item counts (popularity) cannot be directly applied to . Furthermore, we observe from our experiments that regularization on does not seem to be needed in practice.

theorem:loglik-complexity,theorem:grad-complexity show that computing the regularized log-likelihood and its gradient both have time complexity linear in . The complexities also depend on , the rank of the NDPP, and , the size of the largest observed subset in the data. In practice though, usually and . Hence, linearity in means that we can efficiently perform learning for datasets with very large ground sets, which is impossible with the cubic-complexity decomposition in prior work Gartrell et al. (2019). We briefly note here that a key component in the proof of theorem:loglik-complexity is use of the Woodbury matrix identity. This allows us to transform an operation into an one, by replacing the computation of a size- with matrix multiplications and a size- . A similar approach was also recently adopted in work on normalizing flows for deep generative models Lu and Huang (2020).

Theorem 1.

Given a nonsymmetric low-rank DPP parameterized by of rank , of rank , and a matrix , where , we can compute the regularized log-likelihood, eq:nonsymm-full-log-likelihood, in time, where is the size of the largest of the training subsets.

Theorem 2.

Given a nonsymmetric low-rank DPP parameterized by of rank , of rank , and a matrix , where , we can compute the gradient of the regularized log-likelihood in eq:nonsymm-full-log-likelihood in time, where is the size of the largest of the training subsets.

To further simplify learning and MAP inference, we set , which results in . This change also simplifies regularization, so that we only perform regularization on , as indicated in the first term of eq:regularization, leaving us with the single regularization hyperparameter of . While setting restricts the class of nonsymmetric kernels that can be represented, we find in practice that this does not adversely impact prediction quality.

To compensate for the restriction imposed by setting , we also relax the block-diagonal structure imposed on , so that we learn a full skew-symmetric matrix . We empirically observe that this relaxation of the block-diagonal structure is needed to ensure learning of nonsymmetric kernels with better predictive performance. To ensure that and thus is skew-symmetric, we set , where .

4 MAP inference

After learning an NDPP, one can then use it to infer the most probable item subsets in various situations. Several inference algorithms have been well-studied for symmetric DPPs, including sampling Kulesza and Taskar (2011); Anari et al. (2016); Li et al. (2016); Launay et al. (2018); Gillenwater et al. (2019); Poulson (2019); Dereziński (2019) and MAP inference Gillenwater et al. (2012b); Han et al. (2017); Chen et al. (2018); Han and Gillenwater (2020). We focus on MAP inference:

(4)

for cardinality budget . MAP inference for DPPs is known to be NP-hard even in the symmetric case Ko et al. (1995); Kulesza et al. (2012). For symmetric DPPs, one usually approximates the MAP via the standard greedy algorithm for submodular maximization Nemhauser et al. (1978). First, we describe how to efficiently implement this for NDPPs. Then, in sec:map-approx-guarantee we prove a lower bound on its approximation quality. To the best of our knowledge, this is the first investigation of how to apply the greedy algorithm to NDPPs.

Greedy begins with an empty set and repeatedly adds the item that maximizes the marginal gain until the chosen set is size . Here, we aim to design an efficient greedy algorithm when the NDPP kernel is given by a low-rank matrix. For generality, in what follows we write the kernel as , since one can easily rewrite our matrix decomposition, as well as that of Gartrell et al. (2019), to take this form. For example, for our decomposition: . Using Schur’s determinant identity, we first observe that, for and , the marginal gain of a NDPP can be written as

(5)

where and . A naïve computation of eq:bilinear-marginalgain is , since we must invert a matrix, where . However, one can compute eq:bilinear-marginalgain more efficiently by observing that it actually can be expressed as a rank- matrix and hence computed in time.

Lemma 2.

Given , , and , let be the -th row in and be a matrix containing rows in indexed by . Then, it holds that

(6)

where row vectors for satisfy , , and

(7)

Plugging eq:bilinear-inverse into eq:bilinear-marginalgain, the marginal gain with respect to can be computed by simply updating from the previous gain with respect to . That is,

(8)
(9)

The marginal gains when are equal to diagonals of and require operations. Then, computing the update terms in eq:marginalgainupdate for all needs operations. Since the total number of updates is , the overall complexity becomes . We provide a full description of the implied greedy algorithm for low-rank NDPPs in alg:greedylowrank.

1:Input: , , the cardinality And for conditioning
2:initialize , and
3: for where is the -th row in
4: and for conditioning
5:while  do
6:     
7:     
8:      and
9:      for
10:      and for conditioning
11:end while
12:return return for conditioning
Algorithm 1 Greedy MAP inference for low-rank NDPPs
Low-rank DPP Models MLE Learning MAP Inference
Symmetric DPP Gartrell et al. (2017)
Nonsymmetric DPP Gartrell et al. (2019)
Scalable nonsymmetric DPP (this work)
Table 1: Runtime complexities for several DPP models. Our model and the symmetric DPP model Gartrell et al. (2017) can perform both tasks in time linear in the size of ground set , but ours is a more general model that can capture positive as well as negative item correlations.

tab:runtimes summarizes the runtime complexitiy of our methods and those of previous work. We also note that memory required for MAP inference is in all cases. For learning, the memory required is for the NDPPs from prior work Gartrell et al. (2019), but only for low-rank symmetric DPPs Gartrell et al. (2017) and our proposed scalable NDPPs.

4.1 Approximation guarantee for greedy NDPP MAP inference

As mentioned above, alg:greedylowrank is an instantiation of the standard greedy algorithm used for submodular maximization Nemhauser et al. (1978). This algorithm has a -approximation guarantee for the problem of maximizing nonnegative, monotone submodular functions. While the function is submodular for a symmetric PSD  Kelmans and Kimelfeld (1983), it is not monotone. Often, as in Han and Gillenwater (2020)

, it is assumed that the smallest eigenvalue of

is greater than , which guarantees montonicity. There is no particular evidence that this assumption is true for typical practical models, but nevertheless the greedy algorithm tends to perform well in practice for symmetric DPPs. Here, we prove a similar approximation guarantee that covers NDPPs as well, even though the function is non-submodular when is nonsymmetric. In sec:mapinference, we further observe that, as for symmetric DPPs, the greedy algorithm seems to work well in practice for NDPPs.

Recently, Bian et al. (2017) proposed an extension of greedy algorithm guarantees to non-submodular functions. Their result is based on the submodularity ratio and curvature of the objective function, which measure to what extent the objective has submodular properties. Leveraging this result, in thm:greedy we provide an approximation ratio for greedy MAP inference of NDPPs.

Theorem 3.

Consider a nonsymmetric low-rank DPP , where are of rank , and . Given a cardinality budget , let and

denote the smallest and largest singular values of

for all and . Assume that . Then,

(10)

where is the output of alg:greedylowrank and is the optimal solution of MAP inference in eq:mapinference.

Thus, when the kernel has a small value of , the greedy algorithm finds a near-optimal solution. In practice, we observe that the greedy algorithm finds a near-optimal solution even for large values of this ratio (see sec:mapinference). As remarked above, there is no evidence that the condition is usually true in practice. While this condition can be achieved by multiplying by a constant, this leads to a (potentially large) additive term in eq:greedyratio. We provide cor:greedy in sec:corr-greedy, which excludes the assumption, and quantifies this additive term.

4.2 Greedy conditioning for next-item prediction

We briefly describe here a small modification to the greedy algorithm that is necessary if one wants to use it as a tool for next-item prediction. Given a set , Kulesza et al. (2012) showed that a DPP with conditioned on the inclusion of the items in the set forms another DPP with kernel where . The singleton probability can be useful for doing next-item prediction. We can use the same machinery from the greedy algorithm’s marginal gain computations to effectively compute these singletons. More concretely, suppose that we are doing next-item prediction as a shopper adds items to a digital cart. We predict the item that maximizes the marginal gain, conditioned on the current cart contents (the set in the greedy algorithm). When the shopper adds the next item to their cart, we update to include this item, rather than our predicted item (line 10 in alg:greedylowrank). We then iterate until the shopper checks out. The comments on the righthand side of alg:greedylowrank summarize this procedure. The runtime of this prediction is the same that of the greedy algorithm, . We note that this cost is comparable to that of an approach based on the DPP dual kernel from prior work Mariet et al. (2019), which has complexity. However, since it is non-trivial to define the dual kernel for NDPPs, the greedy algorithm may be the simpler choice for next-item prediction for NDPPs.

5 Experiments

5.1 Datasets

We perform experiments on several real-world public datasets composed of online shopping baskets:

  1. Amazon Baby Registries: This dataset consists of registries or "baskets" of baby products, and has been used in prior work on DPP learning Gartrell et al. (2016, 2019); Gillenwater et al. (2014); Mariet and Sra (2015). The registries contain items from 15 different categories, such as “apparel”, with a catalog of up to 100 items per category. Our evaluation mirrors that of Gartrell et al. (2019); we evaluate on the popular apparel category, which contains 14,970 registries, as well as on a dataset composed of the three most popular categories: apparel, diaper, and feeding, which contains a total of 31,218 registries.

  2. UK Retail: This dataset Chen et al. (2012) contains baskets representing transactions from an online retail company that sells unique all-occasion gifts. We omit baskets with more than 100 items, leaving us with a dataset containing 19,762 baskets drawn from a catalog of 3,941 products. Baskets containing more than 100 items are in the long tail of the basket-size distribution of the data, so omitting larger baskets is reasonable, and allows us to use a low-rank factorization of the DPP with .

  3. Instacart: This dataset Instacart (2017) contains baskets purchased by Instacart users. Again, we omit baskets with more than 100 items, leaving us with 3.2 million baskets and a catalog of 49,677 products.

5.2 Experimental setup and metrics

We use a small held-out validation set, consisting of 100 randomly-selected baskets, for tracking convergence during training and for tuning hyperparameters. A random selection of 2000 of the remaining baskets are used for testing, and the rest are used for training. Convergence is reached during training when the relative change in validation log-likelihood is below a predetermined threshold. We use PyTorch with Adam 

Kingma and Ba (2015) for optimization.

Subset expansion task. We use greedy conditioning to do next-item prediction, as described in subsec:greedy-conditioning. We compare methods using a standard recommender system metric: mean percentile rank (MPR) Hu et al. (2008); Li et al. (2010). MPR of 50 is equivalent to random selection; MPR of 100 means that the model perfectly predicts the next item. See sec:MPR for a complete description of the MPR metric.

Subset discrimination task. We also test the ability of a model to discriminate observed subsets from randomly generated ones. For each subset in the test set, we generate a subset of the same length by drawing items uniformly at random (and we ensure that the same item is not drawn more than once for a subset). We compute the AUC for the model on these observed and random subsets, where the score for each subset is the log-likelihood that the model assigns the subset.

Amazon: Apparel ()
Amazon: 3-category ()
Metric Sym Nonsym Scalable nonsym Sym Nonsym Scalable nonsym
MPR 77.42 1.12 80.32 0.75 84.86 1.51 60.61 0.94 75.09 0.85 75.08 1.64
AUC 0.66 0.01 0.73 0.01 0.71 0.03 0.70 0.01 0.79 0.01 0.80 0.04
test log-likelihood -10.31 -9.66 -9.64 -18.11 -17.02 -17.22
UK Retail ()
Instacart ()
Metric Sym Nonsym Scalable nonsym Sym Nonsym Scalable nonsym
MPR 76.79 0.60 79.45 0.57 81.64 1.22 91.01 1.22 - 92.90 0.49
AUC 0.57 0.001 0.65 0.01 0.76 0.02 0.83 0.01 - 0.84 0.01
test log-likelihood -120.47 -108.67 -106 -70.81 - -69.12
Table 2: Average MPR, AUC, and test log-likelihood for all datasets, for the low-rank symmetric DPP Gartrell et al. (2017), low-rank NDPP Gartrell et al. (2019)

, and our scalable NDPP models. MPR and AUC results show 95% confidence estimates obtained via bootstrapping. Bold values indicate improvement over the symmetric low-rank DPP outside of the confidence interval. See sec:hyperparams for the hyperparameter settings used in these experiments. The baseline NDPP model cannot be feasibly trained on Instacart, as memory and computational costs are prohibitive due to its large ground set size.

5.3 Predictive performance results for learning

Since the focus of our work is on improving NDPP scalability, we use the low-rank symmetric DPP Gartrell et al. (2017) and the low-rank NDPP of prior work Gartrell et al. (2019) as baselines for our experiments. tab:predictive-qual compares these DPP approaches and our scalable low-rank NDPP. We see that our scalable NDPP matches or exceeds the predictive quality of the baseline NDPP. Notice that our scalable NDPP opens to the door to training on datasets with large , such as the Instacart dataset, which is infeasible for the baseline NDPP due to very high memory and compute costs. As expected, we also empirically observe that the scalable NDPP trains far faster than the NDPP for datasets with large ground sets. For example, the per-iteration gradient update of scalable NDPP is faster than that of the decomposition of Gartrell et al. (2019) on the UK dataset. See sec:training-time for a comparison of overall training times.

5.4 Performance results for MAP inference

We run various approximatation algorithms for MAP inference, including the greedy algorithm (alg:greedylowrank), stochastic greedy algorithm Mirzasoleiman et al. (2015), MCMC-based DPP sampling Li et al. (2016), and greedy local search Kathuria and Deshpande (2016). The stochastic greedy algorithm computes marginal gains of a few items chosen uniformly at random and selects the best among them. The MCMC sampling begins with a random subset of size and picks and uniformly at random. Then, it swaps them with probability and iterates this process. The greedy local search algorithm Kathuria and Deshpande (2016) starts from the output from the greedy algorithm, , and replaces with that gives the maximum improvement, if such exist. This replacement process iterates until no improvement exists, or at most steps have been completed, to guarantee a tight approximation Kathuria and Deshpande (2016). We use greedy local search as a baseline since it always returns a better solution than greedy. However, it is the slowest among all algorithms, as its time complexity is . We choose , and provide more details of all algorithms in sec:benchmark.

To evaluate the performance of MAP inference, we report the relative log-determinant ratio, defined as where is the output of benchmark algorithms and is the greedy local search result. Results are reported in tab:mapinference. We observe that the greedy algorithm achieves performance close to that of the significantly more expensive greedy local search algorithm, with relative errors of up to . Stochastic greedy and MCMC sampling have significantly larger errors.

Algorithms
Amazon: Apparel
Amazon: 3-category
UK Retail
Instacart
Greedy (alg:greedylowrank) 0.0196 0.0020 0.0613 0.0027 0.0498 0.0017 0.0085 0.0005
Stochastic greedy Mirzasoleiman et al. (2015) 0.1296 0.0041 0.1716 0.0041 0.1526 0.0028 0.1408 0.0044
MCMC sampling Li et al. (2016) 0.5437 0.0083 0.7640 0.0092 0.8930 0.0085 3.2646 0.0401
Table 3: Average relative error and 95% confidence intervals of MAP inference algorithms on NDPPs learned from real-world datasets. For all datasets, we evaluate kernels learned with different initializations, and run random trials for stochastic greedy and MCMC sampling. All errors are relative to greedy local search.

5.5 Performance guarantee for greedy MAP inference

The matrices learned on real datasets are too large to compute the exact MAP solution, but we can compute exact MAP for small matrices. In this section, we explore the performance of the greedy algorithm studied in thm:greedy for synthetic kernel matrices. More formally, we first pick singular values from a kernel learned for the “Amazon: 3-category” dataset (a plot of these singular values can be seen in fig:approxratio:singular) and generate , where are random orthonormal matrices. To ensure that is a matrix, we repeatedly sample until all principal minors of are nonnegative. We also evaluate the performance of the symmetric DPP, where the kernel matrices are generated similarly to the NDPP, except we set . We set and generate random kernels for both symmetric DPPs and NDPPs.

The results for symmetric and nonsymmetric DPPs are shown in fig:approxratio:sym and fig:approxratio:nonsym, respectively. We plot the approximation ratio of alg:greedylowrank, i.e., , with respect to , from cor:greedy. We observe that the greedy algorithm for both often shows approximation ratios close to . However, the worst-case ratio for NDPPs is worse than that of symmetric DPPs; for is non-submodular, and the greedy algorithm with a nonsubmodular function does not have as tight of a worst-case bound as in the symmetric case.

(a) Symmetric DPP
(b) Nonsymmetric DPP
(c) Singular values
Figure 1: Approximation ratios of greedy with respect to different values of from cor:greedy under (a) symmetric DPP and (b) nonsymmetric DPP. (c) The singular values of the kernels learned for the “Amazon: 3-category” dataset. We construct random matrices , with rank , whose singular values are from the learned kernels.

6 Conclusion

We have presented a new decomposition for nonsymmetric DPP kernels that can be learned in time linear in the size of the ground set, which is a significant improvement over the cubic complexity of prior work. Empirical results indicate that this decomposition matches or exceeds the predictive performance of the prior decomposition. We have also proved a lower bound on the quality of the greedy MAP approximation algorithm for nonsymmetric DPPs, and showed how to implement it efficiently. For future work, we will investigate tightening the performance bound for the greedy MAP algorithm, work on developing intuition about the meaning of the parameters in the matrix, and consider kernel decompositions that cover other parts of the nonsymmetric space.

Broader Impact

In general, we feel that our work moves in a positive direction by decreasing the storage and computation costs of learning NDPPs. However, in terms of broader impact, concerns related to this work are similar to those of other recommender system work (Milano et al., 2020). When applying our methods to learn kernels from user data, we recommend employing a technique such as differentially-private SGD (Abadi et al., 2016) to help prevent user data leaks, and adjusting the weights on training examples to balance the impact of sub-groups of users so as to make the final kernel as fair as possible.

References

Appendix A Mean Percentile Rank

We begin our definition of MPR by defining percentile rank (PR). First, given a set , let . The percentile rank of an item given a set is defined as

where indicates those elements in the ground set that are not found in .

For our evaluation, given a test set , we select a random element and compute . We then average over the set of all test instances to compute the mean percentile rank (MPR):

Appendix B Hyperparameters for experiments in tab:predictive-qual

Preventing numerical instabilities: The first term on the right side of eq:nonsymm-full-log-likelihood will be singular whenever , where is an observed subset. Therefore, to address this in practice we set to the size of the largest subset observed in the data, , as in [13]. However, this does not entirely address the issue, as the first term on the right side of eq:nonsymm-full-log-likelihood may still be singular even when . In this case though, we know that we are not at a maximum, since the value of the objective function is . Numerically, to prevent such singularities, in our implementation we add a small correction to each when optimizing eq:nonsymm-full-log-likelihood (we set in our experiments).

We perform a grid search using a held-out validation set to select the best performing hyperparameters for each model and dataset. The hyperparameter settings used for each model and dataset are described below.

Symmetric low-rank DPP [12]. For this model, we use for the number of item feature dimensions for the symmetric component , and for the regularization hyperparameter for . We use the following hyperparameter settings:

  • Both Amazon datasets: .

  • UK Retail dataset: .

  • Instacart dataset: .

Baseline NDPP [11]. For this model, to ensure consistency with the notation used in [11], we use to denote the number of item feature dimensions for the symmetric component , and to denote the number of item feature dimensions for the nonsymmetric components, and . As described in [11], is the regularization hyperparameter for the , while and are the regularization hyperparameters for and , respectively. We use the following hyperparameter settings:

  • Both Amazon datasets: .

  • Amazon apparel dataset: .

  • Amazon three-category dataset: .

  • UK Retail dataset: .

  • All datasets: .

Scalable NDPP. As described in sec:model, we use to denote the number of item feature dimensions for the symmetric component and the dimensionality of the nonsymmetric component . is the regularization hyperparameter. We use the following hyperparameter settings:

  • Amazon apparel dataset: .

  • Amazon three-category dataset: .

  • UK dataset: .

  • Instacart dataset: .

For all of the above model configurations we use a batch size of 200 during training, except for the scalable NDPPs trained on the Amazon apparel and Instacart datasets, where a batch size of 400 is used.

Appendix C Training time

In fig:training-time, we report the wall-clock training time of the decomposition of [11] (NDPP) and our scalable NDPP for the Amazon: 3-category (fig:training-time:amazonthree) and UK Retail (fig:training-time:ukretail) datasets. For the Amazon: 3-category dataset, both approaches show comparable results, with the scalable NDPP converging times faster than NDPP. But for the UK Retail dataset, which has a much larger ground set, our scalable NDPP achieves convergence about times faster. We do not have a timing comparison for the Instacart dataset because the model with the decomposition of [11] cannot be trained on this dataset due to prohibitive memory and computational costs.

(a) Amazon: 3-category
(b) UK Retail
Figure 2: The negative log-likelihood of the training set for Gartrell et al. [11]’s NDPP (blue, dashed) and our scalable NDPP (red, solid) versus wall-clock time for the (a) Amazon: 3-category and (b) UK Retail datasets.

Appendix D Benchmark algorithms for MAP inference

We test following approximate algorithms for MAP inference:

Greedy local search.

This algorithm starts from the output of greedy, , and replaces with that gives the maximum improvement of the determinant, if such exist. Kathuria and Deshpande [22] showed that running the search for such a swap times with an accuracy parameter gives a tight approximation guarantee for MAP inference for symmetric DPPs. We set the number of swaps to for and use greedy local search as a baseline, since it is strictly an improvement on the greedy solution. However, greedy local search requires operations, and thus it is the slowest among all of our baseline algorithms.

Stochastic greedy.

This algorithm computes marginal gains of a few items chosen uniformly at random and selects the best among them. [36] proved that samples are enough to guarantee an approximation ratio for submodular functions (i.e., symmetric DPPs). We choose and set the number of samples to . Under this setting, the time complexity of stochastic greedy is , which is better than the naïve exact greedy algorithm. However, we note that it is worse than that of our efficient greedy implement (alg:greedylowrank). This is because the stochastic greedy uses different random samples for every iteration and this does not take advantage of the amortized computations in lmm:bilinear-inverse. In our experiments, we simply modify line 10 in alg:greedylowrank for stochastic greedy ( is operated on a random subset of marginal gains), hence it can run in time. In practice, we observe that stochastic greedy is slightly slower than exact greedy due to the additional costs of random sampling process.

MCMC sampling.

We also compare inference algorithms with sampling from a DPP with fixed size (known as a -DPP). Exact sampling [27] requires eigendecomposition of , which is infeasible for a large

. To resolve this, Markov Chain Monte-Carlo (MCMC) based sampling is preferred. In particular, we consider the Metropolis-Hastings algorithm, which begins with a random subset

with size , and picks and uniformly at random. Then, it swaps them with probability

(11)

and repeats this process for several steps. Recent work [29] shows that the MCMC sampling provides promising results for kernel approximation. We set the number of swaps to (the same as for greedy local search), for a runtime complexity of , which is better than the greedy algorithm.

We provide the wall-clock time of the above algorithms for real-world datasets in tab:mapinferencetime. Observe that the greedy algorithm is the fastest method for all datasets except Instacart. For Instacart, MCMC sampling is faster than other approaches, but it has much larger relative errors in terms of log-determinant (see tab:mapinference), which is not suitable for our purposes.

Algorithms
Amazon: Apparel
Amazon: 3-category
UK Retail
Instacart
Greedy local search [22] 4.28 ms 7.11 ms 36.92 ms 468.50 ms
Greedy (alg:greedylowrank) 0.19 ms 0.27 ms 1.88 ms 21.86 ms
Stochastic greedy [36] 0.26 ms 0.38 ms 2.03 ms 24.40 ms
MCMC sampling [29] 6.30 ms 6.44 ms 10.88 ms 16.86 ms
Table 4: Wall-clock time (in milliseconds) of MAP inference algorithms on NDPPs learned from real-world datasets.

Appendix E Corollary of thm:greedy

thm:greedy requires the technical condition but in practice there is no particular evidence that this condition holds. While this condition can be achieved by multiplying by a constant, this leads to a (potentially large) additive term in eq:greedyratio. Here, we provide cor:greedy which excludes the assumption from thm:greedy, and quantifies this additive term.

Corollary 1.

Consider a nonsymmetric low-rank DPP , where are of rank , and . Given a cardinality budget , let and denote the smallest and largest singular values of for all and . Let . Then,

(12)

where is the output of alg:greedylowrank and is the optimal solution of MAP inference in eq:mapinference.

The proof of cor:greedy is provided in sec:proof-corr-greedy. Note that instead of , cor:greedy has a term in the denominator.

Appendix F Proofs

f.1 Proof of lemma:skew-sym-decomp

See 1

Proof.

First,

for some orthogonal matrix

and

(see, e.g.,[41, Proposition 2.1], which is easily extended to the case when

is odd).

Let be the submatrix of obtained by keeping its first rows and columns and let , where is the identity matrix. Then, and one can write . Setting proves the lemma. ∎

f.2 Proof of theorem:loglik-complexity

See 1

Proof.

To show that this log-likelihood can be computed in time linear in , we first show that the DPP normalization term can be computed in linear time. We briefly note here that a key component in the proof of the normalization term complexity is use of the Woodbury matrix identity, which allows us to transform an operation into an one. A similar approach was also recently adopted in normalizing flow for deep generative models [31].

First, assume that is invertible. Otherwise, replace with for some small . Then we have:

(13)
(14)
(15)
(16)

eq:norm-com-proof-0 follows from application of the generalized form of the matrix determinant lemma. eq:norm-com-proof-1 follows from application of the Weinstein–Aronszajn identity. eq:norm-com-proof-2 follows from application of Woodbury’s matrix identity. If we analyze the final form in eq:norm-com-proof-3, we see that it consists of three determinants of matrices. Thus, once the internal matrices are computed, the cost of the determinants is . The most expensive internal matrix computation is . Computing takes operations. Computing the inverse takes operations, and the remaining multiplications by require operations. Therefore, computing requires operations overall.

Having established that the normalization term in the likelihood can be computed in time, we proceed with characterizing the complexity of the other terms in the likelihood. The first term in eq:nonsymm-full-log-likelihood consists of determinants of size . Assuming that these never exceed size , each can be computed in at most time. The regularization term is a simple sum of norms that can be computed in time. Therefore, the full regularized log-likelihood can be computed in time. ∎

f.3 Proof of theorem:grad-complexity

See 2

Proof.

To prove that the gradient of the log-likelihood can be computed in time linear in , we begin by showing that the gradient of the DPP’s normalization term can be computed in linear time.

From eq:norm-com-proof-1 in theorem:loglik-complexity’s proof, we have:

(17)

The gradient of has three parts: . We derive each below, making use of the standard rule for the gradient of : for any matrix , . If we define , the gradients are:

(18)
(19)