1 Introduction
Subset selection problems arise in a number of applications, including recommendation [9]
[16, 21], and Web search [15]. In these domains, we are concerned with selecting a good subset of highquality items that are distinct. For example, a recommended subset of products presented to a user should have high predicted ratings for that user while also being diverse, so that we increase the chance of capturing the user’s interest with at least one of the recommended products.Determinantal point processes (DPPs) offer an attractive model for such tasks, since they jointly model set diversity and quality or popularity, while offering a compact parameterization and efficient algorithms for performing inference. A distribution over sets that encourages diversity is of particular interest when recommendations are complementary; for example, when a shopping basket contains a laptop and a carrier bag, a complementary addition to the basket would typically be a laptop cover, rather than another laptop.
DPPs can be parameterized by a positive semidefinite matrix, where is the size of the item catalog. There has been some work focused on learning DPPs from observed data consisting of example subsets [1, 9, 16, 24], which is a challenging learning task that is conjectured to be NPhard [17]. The most recent of this work has involved learning a nonparametric fullrank matrix [9, 24] that does not constrain to take a particular parametric form, which becomes problematic with large item catalogs, as we will see in this paper. In contrast, we present a method for learning a lowrank factorization of , which scales much better than fullrank approaches and in some cases provides better predictive performance. The scalability improvements allow us to train our model on larger datasets that are infeasible with a fullrank DPP, while also opening the door to computing online recommendations as required for many realworld applications.
In addition to the applications mentioned above, DPPs have been used for a variety of machine learning tasks
[12, 14, 17, 18, 29]. We focus on the recommendation task of “basket completion” in this work, where we compute predictions for the next item that should be added to a shopping basket, given a set of items already present in the basket. This task is at the heart of online retail experiences, such as the Microsoft Store.^{1}^{1}1www.microsoftstore.comOur work makes the following contributions:

[itemsep=0pt]

We present a lowrank DPP model, including algorithms for learning from observed data and computing predictions in the basketcompletion scenario.

We perform a detailed experimental evaluation of our model on several realworld datasets, and show that our approach scales substantially better than a fullrank DPP model, while providing equivalent or better predictive performance than the fullrank model. We attribute our improvements in predictive performance to the novel use of regularization in our model.

In addition to comparing our approach to a fullrank DPP, we also compare to several other models for basket completion and show significant improvements in predictive performance in many cases.
2 Model
2.1 Background
A DPP is a distribution over configurations of points.^{2}^{2}2 DPPs originated in statistical mechanics [23], where they were used to model distributions of fermions. Fermions are particles that obey the Pauli exclusion principle, which indicates that no two fermions can occupy the same quantum state. As a result, systems of fermions exhibit a repulsion or “antibunching” effect, which is described by a DPP. This repulsive behavior is a key characteristic of DPPs, which makes them a capable model for diversity. In this paper we deal only with discrete DPPs, which describe a distribution over a discrete ground set of items , which we also call the item catalog. A discrete DPP on
is a probability measure
on (the power set or set of all subsets of ), such that for any , the probability is specified by . In the context of basket completion, is the item catalog (inventory of items on sale), and is the subset of items in a user’s basket; there are possible baskets. The notation denotes the principal submatrix of the DPP kernel indexed by the items in . Intuitively, the diagonal entry of the kernel matrix captures the importance or quality of item , while the offdiagonal entry measures the similarity between items and .The normalization constant for follows from the observation that . The value associates a “volume” to basket , and its probability is normalized by the “volumes” of all possible baskets . Therefore, we have
(1) 
We use a lowrank factorization of the matrix,
(2) 
for the matrix , where is the number of items in the item catalog and is the number of latent trait dimensions. As we shall see in this paper, this lowrank factorization of leads to significant efficiency improvements compared to a model that uses a fullrank matrix when it comes to model learning and computing predictions. This also places an implicit constraint on the space of subsets of , since the model is restricted to place zero probability mass on subsets with more than
items (all eigenvalues of
beyond are zero). We see this from the observation that a sample from a DPP will not be larger than the rank of [8].2.2 Learning
Our learning task is to fit a DPP kernel based on a collection of observed subsets composed of items from the item catalog . These observed subsets constitute our training data, and our task is to maximize the likelihood for data samples drawn from the same distribution as . The loglikelihood for seeing is
(3)  
(4) 
where indexes the observations or objects in . We call the loglikelihood function , to avoid confusion with the matrix . Recall from (2) that .
The next two subsections describe how we perform optimization and regularization for learning the DPP kernel.
2.3 Optimization Algorithm
We determine the matrix by gradient ascent. Therefore, we want to quickly compute the derivative , which will be a matrix. For and , we need a matrix of scalar derivatives,
Taking the derivative of each term of the loglikelihood, we have
(5) 
To compute the first term of the derivative, we see that
(6) 
where denotes row of the matrix and denotes column of . Note that . Computing is a usually a relatively inexpensive operation, since the number of items in each training instance is generally small for many recommendation applications.
To compute the second term of the derivative, we see that
(7) 
where denotes row of the matrix . Computing is a relatively inexpensive operation, since we are inverting a matrix with cost , and (the number of latent trait dimensions) is usually set to a small value.
2.3.1 Stochastic Gradient Ascent
We implement stochastic gradient ascent with a form of momentum known as Nesterov’s Accelerated Gradient (NAG) [26]:
(8)  
(9) 
where accumulates the gradients, is the learning rate, is the momentum/NAG coefficient, and is the gradient at .
We use the following schedule for annealing the learning rate:
(10) 
where is the initial learning rate, is the iteration counter, and is number of iterations for which should be kept nearly constant. This serves to keep nearly constant for the first training iterations, which allows the algorithm to find the general location of the local maximum, and then anneals at a slow rate that is known from theory to guarantee convergence to a local maximum [28]. In practice, we set so that is held nearly fixed until the iteration just before the test loglikelihood begins to decrease (which indicates that we have likely “jumped” past the local maximum), and we find that setting and works well for the datasets used in this paper. Instead of computing the gradient using a single training instance for each iteration, we compute the gradient using more than one training instance, called a “minibatch”. We find that a minibatch size of 1000 instances works well for the datasets we tested.
2.4 Regularization
We add a quadratic regularization term to the loglikelihood, based on item popularity, to discourage large parameter values and avoid overfitting. Since not all items in the item catalog are purchased with the same frequency, we encode prior assumptions into the regularizer. The motivation for using item popularity in the regularizer is that the magnitude of the
dimensional item vector can be interpreted as the popularity of the item, as shown in
[8, 17].(11) 
where is the row vector from for item , and is an element from a vector whose elements are inversely proportional to item popularity,
(12) 
where is the number of occurrences of item in the training data.
Taking the derivative of each term of the loglikelihood with this regularization term, we now have
(13) 
2.5 Predictions
We seek to compute singleton nextitem predictions, given a set of observed items. An example of this class of problem is “basket completion”, where we seek to compute predictions for the next item that should be added to shopping basket, given a set of items already present in the basket.
We use a DPP to compute nextitem predictions. A DPP is a distribution over all subsets with cardinality , where is the ground set, or the set of all items in the item catalog. Next item predictions are done via a conditional density. We compute the probability of the observed basket , consisting of items. For each possible item to be recommended, given the basket, the basket is enlarged with the new item to items. For the new item, we determine the probability of the new set of items, given that items are already in the basket. This machinery is also applicable when recommending a set , which may contain more than one added item, to the basket.
A DPP is obtained by conditioning a standard DPP on the event that the set , a random set drawn according to the DPP, has cardinality . Formally, for the DPP we have:
(14) 
where . Unlike (1), the normalizer sums only over sets that have cardinality .
As shown in [17], we can condition a DPP on the event that all of the elements in a set are observed. We use to denote the kernel matrix for this conditional DPP (the same notation is used for the conditional kernel of the corresponding DPP, since the kernels are the same); we show in Section 2.5.1 how to efficiently compute this conditional kernel. For a set not intersecting with , where we have:
(15)  
(16)  
(17)  
(18) 
where here is a singleton set containing the possible next item for which we would like to compute a predictive probability. denotes the principal submatrix of indexed by the items in .
Ref. [17] shows that the kernel matrix for a conditional DPP is
(19) 
where is the restriction of to the rows and columns indexed by elements in , and is the matrix with ones in the diagonal entries indexed by elements of and zeroes everywhere else.
The normalization constant for Eq. 18 is
(20) 
where the sum runs over all sets of size that are disjoint from . How can we compute it analytically?
We see from [17] that
(21) 
where are the eigenvalues of and is the th elementary symmetric polynomial on .^{3}^{3}3Recall that when is defined in a lowrank form, then all eigenvalues for , greatly simplifying the computation. When is full rank, this is not the case. Section 3 compares the practical performance of a fullrank and lowrank .
Therefore, to compute the conditional probability for a single item in singleton set , given the appearance of items in a set , we have
(22)  
(23)  
(24) 
where are the eigenvalues of and is the first elementary symmetric polynomial on these eigenvalues.
2.5.1 Efficient DPP Conditioning
The conditional probability used for prediction (and hence set recommendation or basket completion) uses in Eq. 19, which requires two inversions of large matrices. These are expensive operations, particularly for a large item catalog (large ). In this section we describe a way to efficiently condition the DPP kernel that is enabled by our lowrank factorization of .
Ref. [8] shows that for a DPP with kernel , the conditional kernel with minors satisfying
(25) 
on , can be computed from by the rank update
(26) 
where consists of the rows and columns of . Substituting into Eq. 26 gives
(27) 
where
(28) 
is a projection matrix, and is thus idempotent: . Since is also symmetric, we have , and substituting into (27) yields
(29)  
(30) 
where
(31) 
Conditioning the DPP using Eq. 30 requires computing the inverse of a matrix, as shown in Eq. 28, which is . This is much less expensive than the matrix inversions in Eq. 19 when , which we expect for most recommendation applications. For example, in online shopping applications, the size of a shopping basket () is generally far smaller than the size of the item catalog ().
3 Evaluation
In this section we compare the lowrank DPP model with a fullrank DPP that uses a fixedpoint optimization algorithm called Picard iteration [24] for learning. We wish to showcase the advantage of lowrank DPPs in practical scenarios such as basket completion. First, we compare test loglikelihood of lowrank and fullrank DPPs and show that the lowrank model’s ability to generalize is comparable to that of the fullrank version. We also compare the training times and prediction times of both algorithms and show a clear advantage for the lowrank model presented in this paper. Our implementations of the lowrank and fullrank DPP models are written in Julia, and we perform all experiments on a Windows 10 system with 32 GB of RAM and an Intel Core i74770 CPU @ 3.4 GHz.
Comparing test loglikelihood values and training time is consistent with previous studies [9, 24]
. Loglikelihood values however are not always correlated with other evaluation metrics. In the recommender systems community it is usually more accepted to use other metrics such as precision@
and mean percentile rank (MPR). In this paper we also compare DPPs (lowrank and fullrank) to other competing methods using these more “traditional” evaluation metrics.Our experiments are based on several datasets:

[leftmargin=*]

Amazon Baby Registries  This public dataset consists of 111,006 registries of baby products from 15 different categories (such as “feeding”, “diapers”, “toys”, etc.), where the item catalog and registries for each category are disjoint. The public dataset was obtained by collecting baby registries from amazon.com and was used by previous DPP studies [9, 24]. In particular, [9] provides an indepth description of this dataset. To maintain consistency with prior work, we used a random split of 70% of the data for training and 30% for testing. We use trait dimensions for the lowrank DPP models trained on this data. While the Baby Registries dataset is relatively large, previous studies analyzed each of its categories separately. We maintain this approach for the sake of consistency with prior work.
We also construct a dataset composed of the concatenation of the three most popular categories: apparel, diaper, and feeding. This threecategory dataset allows us to simulate data that could be observed for department stores that offer a wide range of items in different product categories. Its construction is deliberate, and concatenates three disjoint subgraphs of basketitem purchase patterns. This dataset serves to highlight differences between DPPs and models based on matrix factorization (MF), as there are no items or baskets shared between the three subgraphs. Collaborative filteringbased MF models – which model each basket and item with a latent vector – will perform poorly for this dataset, as the latent vectors of baskets and items in one subgraph could be arbitrarily rotated, without affecting the likelihood or predictive error in any of the other subgraphs. MF models are invariant to global rotations of the embedded vectors. However, for the concatenated dataset, these models are also invariant to arbitrary rotations of vectors in each disjoint subgraph for the concatenated data set, as there are no shared observations between the three categories. A global ranking based on inner products could then be arbitrarily affected by the basket and item embeddings arising from each subgraph.
The lowrank approximation presented in this paper facilitates scalingup DPPs to much larger datasets. Therefore, we conducted experiments on two additional realworld datasets, as we explain next.

MS Store  This is a proprietary dataset composed of shopping baskets purchased in Microsoft’s Webbased store microsoftstore.com. It consists of 243,147 purchased baskets composed of 2097 different hardware and software items. We use a random split of 80% of the data for training and 20% for testing. For the lowrank DPP model trained on this data, we use trait dimensions.

Belgian Retail Supermarket  This is a public dataset [4, 3] composed of shopping baskets purchased over three nonconsecutive time periods from a Belgian reatil supermarket store. There are 88,163 purchased baskets, composed of 16,470 unique items. We use a random split of 80% of the data for training and 20% for testing. We use trait dimensions for the lowrank DPP model trained on this data.
Since we are interested in the basket completion task, which requires baskets containing at least two items, we remove all baskets containing only one item from each dataset before splitting the data into training and test sets.
We determine convergence during training of both the lowrank and fullrank DPP models using
which measures the relative change in training loglikelihoods from one iteration to the next. We set .
3.1 Full Rank vs. Low Rank
Baby Registry 


Category  FRank  LRank 
Furniture  7.07391  7.00022 
Carseats  7.20197  7.27515 
Safety  7.08845  7.01632 
Strollers  7.83098  7.83201 
Media  12.29392  12.39054 
Health  10.09915  10.36373 
Toys  11.06298  11.07322 
Bath  11.89129  11.88259 
Apparel  13.84652  13.85295 
Bedding  11.53302  11.58239 
Diaper  13.13087  13.16574 
Gear  12.30525  12.17447 
Feeding  14.91297  14.87305 
Gifts  4.94114  4.96162 
Moms  5.39681  5.34985 


MS Store  
FRank  LRank  
All Products  15.10  15.23 

We begin with comparing test loglikelihood values of the lowrank DPP model presented in this paper with the fullrank DPP trained using Picard iteration. Table 1 depicts the average test loglikelihoods values of both models across the different categories of the Baby Registries dataset as well as the MS Store dataset. In the Baby Registry dataset the fullrank model seems to perform better in 9 categories compared with 6 categories for the lowrank model, and for the MS Store dataset the fullrank model performed better. The differences in the loglikelihood values are small, and as we show in Section 3.2 these differences do not necessarily translate into better results for other evaluation metrics.
3.1.1 Training Time
A key contribution of the Picard iteration method was the improvement of training time (convergence time) by up to an order of magnitude [24] compared to previous methods. However, the Picard iteration method requires inverting an fullrank matrix, where is the number of items in the catalog. This matrix inversion operation has a time complexity. In the lowrank model, this operation is replaced by an inversion of a matrix where , and training is performed by stochastic gradient ascent. This translates into considerably faster training times, particularity in cases where the item catalog is large.
Figure 1(a) depicts the training time in seconds of the fullrank (FRank) model vs. the lowrank (LRank) DPP model described in this paper. Table 2 shows the number of iterations required for each model to reach convergence. Training times are shown for each of the 15 categories in the Baby Registry dataset. In all but one category, the training time of the lowrank model was considerably faster. On average, the lowrank model is 8.9 times faster to train than the fullrank model.
Category  LRank  FRank 

Mom  67  1294 
Gifts  126  1388 
Feeding  68  123 
Gear  82  136 
Diaper  83  1065 
Bedding  88  772 
Apparel  48  129 
Bath  64  1664 
Toys  66  970 
Health  68  1337 
Media  126  958 
Strollers  53  1637 
Safety  59  1306 
Carseats  54  1218 
Furniture  54  1277 
3.1.2 Prediction Time
In production settings, training is usually performed in advance (offline), while predictions are computed per request (online). A typical realworld recommender system models at least thousands of items (and often much more). The “relationships” between items changes slowly with time and it is reasonable to train a model once a day or even once a week. The number of possible baskets, however, is vast and depends on the number of items in the catalog. Therefore, it is wasteful and sometimes impossible to precompute all possible basket recommendations in advance. The preferred choice in most cases would be to compute predictions online, in real time.
High prediction times may overload online servers, leading to high response times and even failure to provide recommendations. The ability to compute recommendations efficiently is key to any realworld recommender system. Hence, in realworld scenarios prediction times are usually much more important than training times.
Previous DPP studies [9, 24] focused on training times and did not offer any improvement in prediction times. In fact, as we show next, the average prediction time spikes for the fullrank DPP when the size of the item catalog reaches several thousand, and quickly becomes impractical in realworld settings where the inventory of items is large and fast online predictions are required. Our lowrank model facilitates far faster prediction times and scales well for large item catalogs, which is key to any practical use of DPPs. We believe this contribution opens the door to largescale use of DPP models in commercial settings.
In Figure 1(b) we compare the average prediction time for a testset basket for each of the 15 categories in the Baby Registry dataset. This figure shows the average time to compute predictive probabilities for all possible items that could added to the basket for a given test basket instance, where the set of possible items are those items found in the item catalog but not in the test basket. Since the catalog is composed of a maximum of only 100 items for each Baby Registry category, due to way that the dataset was constructed, we see that these prediction times are quite small. Again we notice a clear advantage for the lowrank model across all categories: the average prediction time for the fullrank model is 2.55 ms per basket, compared with 0.39 ms for the lowrank model (6.8 times faster). Since number of items in the catalog for each baby registry category is small (100 items), we also measured the prediction time for the MS Store dataset, which contains 2,097 items. Due to the much larger item catalog, the average time per a single basket prediction increases significantly to 1.66 seconds, which is probably too slow for many realworld recommender systems. On the other hand, the average prediction time of the lowrank model depends mostly on the number of trait dimensions in the model and takes only 83.6 ms per basket on average. These numbers indicate a speedup factor of 19.9.
Our lowrank DPP model also provides substantial savings in memory consumption as compared to the fullrank DPP. For example, the MS Store dataset, composed of a catalog of 2097 items, would require to store the fullrank DPP kernel matrix (assuming 64bit floating point numbers), while only would be required to store the lowrank matrix with trait dimensions. Therefore, the lowrank model requires approximately 140 times less memory to store the model parameters in this example, and this savings increases with larger item catalogs.
3.2 Basket Completion and Recommendations
Previous papers have evaluated DPP recommendations by comparing test loglikelihood values. In this section we also consider more “traditional” evaluation metrics commonly used in the recommender systems community.
We formulate the basketcompletion task as follows. Let be a subset of copurchased items (i.e, a basket) from the testset. In order to evaluate the basket completion task, we pick an item at random and remove it from . We denote the remaining set as . Formally, . Given a ground set of possible items , we define the candidates set as the set of all items except those already in ; i.e., . Our goal is to identify the missing item from all other items in .
We compare the lowrank DPP model with the fullrank DPP model. We also consider several other competing models for the basket completion task:

[leftmargin=*]

Poisson Factorization (PF)  Poisson factorization (PF) [10]
is a recent variant of probabilistic matrix factorization that has been shown to work well with implicit recommendation data, such as clicks or purchases. PF models useritem interactions, such as clicks or purchases, with factorized Poisson distributions, and learns sparse, nonnegative trait vectors for latent user preferences and item attributes in a lowdimensional space. Gamma priors are placed on the trait vectors; we set the gamma shape and rate hyperparameters to 0.3, following
[6, 10]. The PF model is not sensitive to these settings, as indicated in [6, 10]. We use a publicly available implementation of PF [5]. (Note that [5] is actually an implementation of PF with a social component; we disable the social component for our tests, resulting in a model equivalent to PF, since our data does not involve a social graph). 
Reco Matrix Factorization (RecoMF)  RecoMF is a matrix factorization model [27]
that is used as the recommendation system for Xbox Live. Sigmoid functions are used to model the odds of a user liking or disliking an item, and RecoMF learns latent trait vectors for users and items, along with user and item biases. Unlike PF, RecoMF requires the generation of synthetic negative training instances, and uses a scheme for sampling negatives based on popularity. RecoMF places Gaussian priors on the trait vectors, and gamma hyperpriors on each. We use the hyperparameter settings described in
[27], which have been found to provide good performance for implicit recommendation data. 
Associative Classifier (AC) 
We use an associative classifier as a competing method, since association rules are often used for market basket analysis
[2, 13]. Our associative classifier is the publicly available implementation [7] of the Classification Based on Associations (CBA) algorithm [22]. We use minimum support and minimum confidence thresholds of 1.0% and 20.0%, respectively. Since associative classifiers don’t provide probability estimates for all possible sets, the model therefore cannot compute rankings for all of the candidate items in
, and we therefore cannot reasonably compute MPR.
The matrixfactorization models are parameterized in terms of users and items. Since we have no explicit users in our data, we construct “virtual” users from the contents of each basket for the purposes of our evaluation, where a new user is constructed for each basket . Therefore, the set of items that has purchased is simply the contents of . Additionally, we use trait dimensions for the matrixfactorization models.
In the following evaluation we consider three measures:

[leftmargin=*]

Mean Percentile Rank (MPR)  Computing the Percentile Rank of an item requires the ability to rank the item against all other items in . Therefore, the MPR evaluation results don’t include the AC model, which ranks only those items for which an association rule was found. For DPPs and other competing methods we ranked the items according to their probabilities to complete the missing set . Namely, given an item from the candidates set , we denote by the probability . The Percentile Rank (PR) of the missing item is defined by
where is an indicator function and is the number of items in the candidates set. The Mean Percentile Rank (MPR) is the average PR of all the instances in the testset:
where is the set of test instances. MPR is a recalloriented metric commonly used in studies that involve implicit recommendation data [11, 20]. always places the heldout item for the test instance at the head of the ranked list of predictions, while is equivalent to random selection.

Precision@  We define precision@ as
where is the predicted rank of the heldout item for test instance . In other words, precision@ is the fraction of instances in the test set for which the predicted rank of the heldout item falls within the top predictions.

Popularityweighted precision@  Datasets used to evaluate recommendation systems typically contain a popularity bias [30], where users are more likely to provide feedback on popular items. Due to this popularity bias, conventional metrics such as MPR and precision@ are typically biased toward popular items. Using ideas from [30], we propose popularityweighted precision@:
where is the weight assigned to the heldout item for test instance , defined as
where is the number of occurrences of the heldout item for test instance in the training data, and . The weights are normalized, so that . This popularityweighted precision@ measure assumes that item popularity follows a powerlaw. By assigning more weight to less popular items, for , this measure serves to bias precision@ towards less popular items. For , we obtain the conventional precision@ measure. We set in our evaluation.
Figures 2, 3, and 4 show the performance of each method and dataset for our evaluation measures. Note that we could not feasibly train the fullrank DPP or AC models on the Belgian dataset, since these models do not scale to datasets with large item catalogs. The performance of lowrank and fullrank DPP models are generally comparable on all models and metrics, with the lowrank DPP providing better performance in some cases. We attribute this advantage to the use of regularization (an informative prior, from a Bayesian perspective) in our lowrank model. We see that the RecoMF model outperforms all other models on all metrics for the Amazon Diaper dataset. For all other datasets, the lowrank DPP model outperforms on MPR by a sizeable margin, and is the only model to consistently provide high MPR across all datasets. For the precision@ metrics, the lowrank DPP often leads, or provides good performance that is close to the leader.
We see interesting results for the Amazon apparel + diaper + feeding dataset. Surprisingly, the PF and RecoMF models provide a MPR of approximately 50%, which is equivalent to basket completion by random selection. Recall that each category in the Amazon baby registry dataset is disjoint. Due to the formulation of the likelihood function for models based on matrix factorization, these models learn an embedding of item trait vectors that is mixed together across each disjoint category. This behavior results in the model mixing the predictions across each category, e.g. recommending an item from category for a basket in category , which is never observed in the data, thus leading to degenerate results. We empirically observe that the DPP models do not have this issue, and are able to effectively learn an embedding of items in this scenario: notice that the DPP models provide an MPR of approximately 70% for both the Amazon threecategory and singlecategory (diaper) datasets.
3.2.1 Limitations
We include the popularityweighted precision@ results in Figure 4 to highlight a limitation of the DPP models. For this metric RecoMF generally provides the best performance, with the DPP models in second place. As discussed in [27], this behavior may result from the scheme for sampling negatives by popularity in RecoMF, which tends to improve recommendations for less popular items. We conjecture that a different regularization scheme for our lowrank DPP model, or a Bayesian version of this model that provides more robust regularization, may improve our performance on this metric. It is also important to note the limitations of this metric, including the assumption that item popularity follows a powerlaw, and the powerlaw exponent setting of 0.5 used when computing the metric for each dataset. Due to these limitations, the popularityweighted precision@ results we present here may not fully reflect the empirical popularity bias actually present in the data.
4 Related Work
Several learning algorithms for estimating the fullrank DPP kernel matrix from observed data have been proposed. Ref. [9]
presented one of the first methods for learning a nonparametric form of the DPP kernel matrix, which involves an expectationmaximization (EM) algorithm. This work also considers using projected gradient ascent on the DPP loglikelihood function, but finds that this is not a viable approach since it usually results in degenerate estimates due to the projection step.
In [24], a fixedpoint optimization algorithm for DPP learning is described, called Picard iteration. Picard iteration has the advantage of being simple to implement and performing much faster than EM during training. We show in this paper that our lowrank learning approach is far faster than Picard iteration and therefore EM during training, and that our lowrank representation of the DPP kernel allows us to compute predictions much faster than any method that uses the fullrank kernel.
Ref. [1]
presented Bayesian methods for learning a DPP kernel, with particular parametric forms for the similarity and quality components of the kernel. Markov chain Monte Carlo (MCMC) methods are used for sampling from the posterior distribution over kernel parameters. In contrast to this work, and similar to
[9, 24], our approach uses a nonparametric form of the kernel and therefore does not assume any particular parametrization.A method for partially learning the DPP kernel is studied in [16]. The similarity component of the DPP kernel is fixed, and a parametric from of the function for the quality component of the kernel is learned. This is a convex optimization problem, unlike the task of learning the full kernel, which is a more challenging nonconvex optimization problem.
We focus on the prediction task of “basket completion” in this work, as it is at the heart of the online retail experience. For the purposes of evaluating our model, we compute predictions for the next item that should be added to a shopping basket, given a set of items already present in the basket. A number of approaches to this problem have been proposed. Ref. [25]
describes a userneighborhoodbased collaborative filtering method, which uses rating data in the form of binary purchases to compute the similarity between users, and then generates a purchase prediction for a user and item by computing a weighted average of the binary ratings for that item. A technique that uses logistic regression to predict if a user will purchase an item based on binary purchase scores obtained from market basket data is described in
[19]. Additionally, other collaborative filtering approaches could be applied to the basket completion problem, such as [27], which is a oneclass matrix factorization model.5 Conclusions
In this paper we have presented a new method for learning the DPP kernel from observed data, which exploits the unique properties of a lowrank factorization of this kernel. Previous approaches have focused on learning a fullrank kernel, which does not scale for large item catalogs due to high memory consumption and expensive operations required during training and when computing predictions. We have shown that our lowrank DPP model is substantially faster and more memory efficient than previous approaches for both training and prediction. Furthermore, through an experimental evaluation using several realworld datasets in the domain of recommendations for shopping baskets, we have shown that our method provides equivalent or sometimes better predictive performance than prior fullrank DPP approaches, while in many cases also providing better predictive performance than competing methods.
6 Acknowledgements
We thank Gal Lavee and Shay Ben Elazar for many helpful discussions. We thank Nir Nice for supporting this work.
References
 [1] R. H. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In ICML, pages 1224–1232, 2014.
 [2] R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In Proc. of SIGMOD 1993, pages 207–216, 1993.
 [3] T. Brijs. Retail market basket data set. In Workshop on Frequent Itemset Mining Implementations (FIMI’03), 2003.
 [4] T. Brijs, G. Swinnen, K. Vanhoof, and G. Wets. Using association rules for product assortment decisions: A case study. In KDD, pages 254–260, 1999.
 [5] A. J. Chaney. Social Poisson factorization (SPF). https://github.com/ajbc/spf, 2105.
 [6] A. J. Chaney, D. M. Blei, and T. EliassiRad. A probabilistic model for using social networks in personalized item recommendation. In RecSys, pages 43–50, 2015.
 [7] F. Coenen. TLUCS KDD implementation of CBA (classification based on associations). http://www.csc.liv.ac.uk/~frans/KDD/Software/CMAR/cba.html, 2004. Department of Computer Science, The University of Liverpool, UK.
 [8] J. Gillenwater. Approximate inference for determinantal point processes. PhD thesis, University of Pennsylvania, 2014.
 [9] J. A. Gillenwater, A. Kulesza, E. Fox, and B. Taskar. Expectationmaximization for learning determinantal point processes. In NIPS, pages 3149–3157, 2014.
 [10] P. Gopalan, J. M. Hofman, and D. M. Blei. Scalable recommendation with hierarchical Poisson factorization. In UAI, 2015.
 [11] Y. Hu, Y. Koren, and C. Volinsky. Collaborative filtering for implicit feedback datasets. In ICDM, pages 263–272, 2008.
 [12] B. Kang. Fast determinantal point process sampling with application to clustering. In NIPS, pages 2319–2327, 2013.
 [13] S. Kotsiantis and D. Kanellopoulos. Association rules mining: A recent overview. GESTS International Transactions on Computer Science and Engineering, 32(1):71–82, 2006.
 [14] A. Kulesza and B. Taskar. Structured determinantal point processes. In NIPS, pages 1171–1179, 2010.
 [15] A. Kulesza and B. Taskar. kDPPs: Fixedsize determinantal point processes. In ICML, pages 1193–1200, 2011.
 [16] A. Kulesza and B. Taskar. Learning determinantal point processes. In UAI, 2011.
 [17] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(23):123–286, 2012.
 [18] J. T. Kwok and R. P. Adams. Priors for diversity in generative latent variable models. In NIPS, pages 2996–3004, 2012.
 [19] J.S. Lee, C.H. Jun, J. Lee, and S. Kim. Classificationbased collaborative filtering using market basket data. Expert Systems with Applications, 29(3):700–704, 2005.
 [20] Y. Li, J. Hu, C. Zhai, and Y. Chen. Improving oneclass collaborative filtering by incorporating rich user information. In CIKM, pages 959–968, 2010.
 [21] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization. In UAI, 2012.
 [22] B. Liu, W. Hsu, and Y. Ma. Integrating classification and association rule mining. In KDD, 1998.
 [23] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, pages 83–122, 1975.
 [24] Z. Mariet and S. Sra. Fixedpoint algorithms for learning determinantal point processes. In ICML, pages 2389–2397, 2015.
 [25] A. Mild and T. Reutterer. An improved collaborative filtering approach for predicting crosscategory purchases based on binary market basket data. Journal of Retailing and Consumer Services, 10(3):123–133, 2003.
 [26] Y. Nesterov. A method of solving a convex program ming problem with convergence rate O(1/sqr(k)). Soviet Mathematics Doklady, 27:327–376, 1983.
 [27] U. Paquet and N. Koenigstein. Oneclass collaborative filtering with random graphs. In WWW, pages 999–1008, 2013.
 [28] H. Robbins and S. Monro. A stochastic approximation method. The Annals of Mathematical Statistics, 22(3):400–407, 1951.
 [29] J. Snoek, R. Zemel, and R. P. Adams. A determinantal point process latent variable model for inhibition in neural spiking data. In NIPS, pages 1932–1940, 2013.
 [30] H. Steck. Item popularity and recommendation accuracy. In RecSys, pages 125–132, 2011.