Deep Determinantal Point Processes

11/17/2018
by   Mike Gartrell, et al.
Criteo
0

Determinantal point processes (DPPs) have attracted significant attention as an elegant model that is able to capture the balance between quality and diversity within sets. DPPs are parameterized by a positive semi-definite kernel matrix. While DPPs have substantial expressive power, they are fundamentally limited by the parameterization of the kernel matrix and their inability to capture nonlinear interactions between items within sets. We present the deep DPP model as way to address these limitations, by using a deep feed-forward neural network to learn the kernel matrix. In addition to allowing us to capture nonlinear item interactions, the deep DPP also allows easy incorporation of item metadata into DPP learning. We show experimentally that the deep DPP can provide a considerable improvement in the predictive performance of DPPs.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

05/30/2019

Learning Nonsymmetric Determinantal Point Processes

Determinantal point processes (DPPs) have attracted substantial attentio...
07/21/2021

A variational approximate posterior for the deep Wishart process

Recent work introduced deep kernel processes as an entirely kernel-based...
02/17/2016

Low-Rank Factorization of Determinantal Point Processes for Recommendation

Determinantal point processes (DPPs) have garnered attention as an elega...
02/15/2018

Learning Determinantal Point Processes by Sampling Inferred Negatives

Determinantal Point Processes (DPPs) have attracted significant interest...
01/20/2022

Scalable Sampling for Nonsymmetric Determinantal Point Processes

A determinantal point process (DPP) on a collection of M items is a mode...
02/10/2021

Simple and Near-Optimal MAP Inference for Nonsymmetric DPPs

Determinantal point processes (DPPs) are widely popular probabilistic mo...
06/17/2020

Scalable Learning and MAP Inference for Nonsymmetric Determinantal Point Processes

Determinantal point processes (DPPs) have attracted significant attentio...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Related Work

Modeling the relationship between items within observed subsets, drawn from a large collection, is an important challenge that is fundamental to many machine learning applications, including recommender systems 

gillenwater-thesis

, document summarization 

kulesza2011learning ; lin12 , and information retrieval kulesza11 . For these applications, we are primarily concerned with selecting a good subset of diverse, high-quality items. Balancing quality and diversity in this setting is challenging, since the number of possible subsets that could be drawn from a collection grows exponentially as the collection size increases.

Determinantal point processes (DPPs) offer an elegant and attractive model for such tasks, since they provide a tractable model that jointly considers set diversity and item quality. A DPP models a distribution over subsets of a ground set that is parametrized by a positive semi-definite matrix , such that for any ,

(1)

where is the submatrix of indexed by . Informally, represents the volume associated with subset , the diagonal entry represents the importance of item , while entry encodes the similarity between items and . DPPs have been studied in the context of a number of applications affandi14 ; chao15 ; krause08 ; mariet16 ; zhang17 in addition to those mentioned above. There has also been significant work regarding the theoretical properties of DPPs kuleszaBook ; borodin2009 ; affandi14 ; kuleszaThesis ; gillenwater-thesis ; decreuse ; lavancier15 .

Learning a DPP from observed data in the form of example subsets is a challenging task that is conjectured to be NP-hard kuleszaBook . Some work has involved learning a nonparametric full-rank matrix gillenwater-thesis ; mariet15 that does not constrain to take a particular parametric form, while other work has involved learning a low-rank factorization of this nonparametric matrix gartrell17 ; osogami18 . A low-rank factorization of enables substantial improvements in runtime performance compared to a full-rank DPP model during training and when computing predictions, on the order of 10-20x or more, with predictive performance that is equivalent to or better than a full-rank model.

While the low-rank DPP model scales well, it has a fundamental limitation regarding model capacity and expressive power due to the nature of the low-rank factorization of . A rank- factorization of

has an implicit constraint on the space of possible subsets, since it places zero probability mass on subsets with more than

items. When trained on a dataset containing subsets with at most items, we observe from the results in gartrell17 that this constraint is reasonable and that the rank- DPP provides predictive performance that is approximately equivalent to that of the full-rank DPP. Therefore, in this scenario the rank- DPP can be seen as a good approximation of the full-rank DPP. However, we empirically observe that the rank- DPP generally does not provide improved predictive performance for values of greater than the size of the largest subset in the data. Thus, for a dataset containing subsets no larger than size , from the standpoint of predictive performance, there is generally no utility in increasing the number of low-rank DPP embedding dimensions beyond , which establishes an upper bound on the capacity of the model. Furthermore, since the determinant is a multilinear function of the columns or rows of a matrix, a DPP is unable to capture nonlinear interactions between items within observed subsets.

The constraints of the standard DPP model motivate us to seek modeling options that enable us to increase the expressive power of the model and improve predictive performance, while still allowing us to leverage the efficient learning, sampling, and prediction algorithms available for DPPs. We present the deep DPP as a model that fulfills these requirements. The deep DPP uses a deep feed-forward neural network to learn the low-rank DPP embedding matrix, allowing us to move beyond the constraints of the standard multilinear DPP model by supporting nonlinearities in the embedding space through the use of multiple hidden layers in the deep network. The deep DPP also allows us to incorporate item-level metadata into the model, such as item names, descriptions, etc. Since the learning target of the deep DPP model is the low-rank DPP embedding matrix, we can use existing algorithms for efficient learning, sampling, and prediction for DPPs. Thus, the deep DPP provides us with an elegant deep generative model for sets.

There has been some prior work regarding the use of a deep neural network to learn DPP model parameters. In xie2017deep , the authors describe one approach that involves a deep network, where the model is parameterized in terms of one data instance (an image) for each observed subset of labels. In contrast, our deep DPP model is more general and allows for each item within the ground set to have its own item-level metadata features (item price, description, image, etc.).

The main contributions of this work are the following:

  • We extend the standard low-rank DPP model by using a deep feed-forward neural network for learning the DPP kernel matrix, which is composed of item embeddings. This approach allows us to arbitrarily increase the expressive power of the deep DPP model by simply adding hidden layers to the deep network. Through experiments on several real-world datasets, we show empirically that compared to the standard low-rank DPP, the deep DPP can provide better predictive quality.

  • The deep DPP supports arbitrary item-level metadata. The deep network in our model allows us to easily incorporate such metadata, and automatically learns parameters that explain how this metadata interacts with the latent item embeddings in our model. In recommendation settings, leveraging metadata has been shown to improve predictive quality kula2015metadata ; vasile2016meta , particularly for cold-start scenarios with sparse user/item interactions.

2 Model

Figure 1: Deep DPP architecture. Observed subsets composed of item ids, as well as optional item-level metadata, are provided as inputs to the deep neural network during learning. The output is the learned DPP parameter matrix

, where each row of this matrix is an item embedding vector.

We begin this section with some background on DPPs and low-rank DPPs, followed by a discussion of the architecture of our deep DPP model.

Since the normalization constant for Eq. 1 follows from the observation that , we have

(2)

where a discrete DPP is a probability measure on (the power set or set of all subsets of ). Therefore, the probability for any is given by Eq. 2.

We use a low-rank factorization of the matrix,

(3)

where , and is the rank of the kernel. is fixed a priori, and is often set to the size of the largest observed subset in the data.

Given a collection of observed subsets composed of items from , our learning task is to fit a DPP kernel based on this data. Our training data is these observed subsets , and our task is to maximize the likelihood for samples drawn from the same distribution as . The log-likelihood for seeing is

(4)

where indexes the observed subsets in .

As described in gartrell17 , we augment ) with a regularization term:

(5)

where counts the number of occurrences of item in the training set, is the corresponding row vector of , and

is a tunable hyperparameter. This regularization term reduces the magnitude of

, which can be interpreted as the popularity of item , according to its empirical popularity .

Figure 1 shows the architecture of the deep DPP model. As shown in this figure, a deep network is used to learn

. Furthermore, this architecture allows us to seamlessly incorporate item metadata, such as price and item name and description, into the model. We use self-normalizing SELU activation functions 

klambauer2017self for our deep network, since we empirically found that this activation function provides stable convergence behavior during training. We use the Adam stochastic optimization algorithm kingma2015adam to train our model, in conjunction with Hogwild recht2011hogwild

for asynchronous parallel updates during training. All code is implemented in PyTorch 0.4.1 

111https://pytorch.org, and will be made publicly available at a later date.

3 Experiments

We perform next-item prediction experiments on a dataset composed of purchased shopping baskets: the UK retail dataset lsbupr1492 . We omit all baskets with more than 100 items, which allows us to use a low-rank factorization of the DPP () that scales well in training and prediction time, while also keeping memory consumption for model parameters to a manageable level. The UK retail dataset contains 25,898 baskets drawn from a catalog of 4,070 items. This dataset provides price and description metadata for each item, which we used in our experiments. For all experiments, a random selection of 80% of the baskets are used for training, and the remaining 20% are used for testing.

Since the focus of our work is on improving DPP predictive performance, we use the standard low-rank DPP as the baseline model for our experiments.

The performance of all methods are compared using the Mean Percentile Rank (MPR) metric. MPR is a recall-based metric which we use to evaluate the model’s predictive power by measuring how well it predicts the next item in a basket; it is a standard choice for recommender systems (hu08, ; li10, ). A MPR of 50 is equivalent to random selection, while a MPR of 100 indicates that the model perfectly predicts the held out item. See Appendix A for a formal definition of MPR.

Number of hidden layers UK without metadata
All Baskets Small Baskets Medium Baskets Large Baskets
0 82.62 2.39 82.62 2.39 81.25 1.83 79.28 2.50
1 82.66 2.36 82.66 2.36 81.32 1.68 81.62 1.92
2 82.74 2.40 82.74 2.40 80.91 2.00 76.38 3.36
3 82.74 2.35 82.74 2.35 80.92 1.95 76.29 3.26
Number of hidden layers UK with metadata
All Baskets Small Baskets Medium Baskets Large Baskets
1 82.54 2.38 82.54 2.38 81.28 1.72 82.16 1.82
2 84.93 2.39 84.93 2.39 85.20 2.02 85.88 2.90
3 82.46 2.45 82.46 2.45 81.03 1.93 77.00 3.22
Table 1:

MPR results for the UK retail dataset, with and without metadata. Results show mean and standard deviation estimates obtained using bootstrapping. Bold values indicate improvement over the low-rank DPP (with 0 hidden layers) outside the standard deviation.

Table 1 shows the results of our experiments. In addition to computing MPR on all test baskets, we also computed results on the test set divided into three equally-sized populations segmented according to basket size. We see that the deep DPP with two hidden layers, and metadata enabled, is able to show small to moderate improvements in MPR over the standard low-rank DPP (with 0 hidden layers) for medium and large baskets. The low-rank DPP is unable to natively support metadata, since this would require manual feature engineering that we have not implemented, and hence results for this model with metadata are not available. We also ran experiments on an additional shopping-basket dataset, which cannot be shown here due to space constraints; see Appendix B.

4 Conclusion and Future Work

We have introduced the deep DPP model, which uses a deep feed-forward neural network to learn the DPP kernel matrix containing item embeddings. The deep DPP overcomes several limitations of the standard DPP model by allowing us to arbitrary increase the expressive power of the model through capturing nonlinear item interactions, which can significantly improve predictive performance, while still leveraging the efficient learning, sampling, and prediction algorithms available for standard DPPs. The deep DPP architecture also allows us to easily incorporate item metadata into DPP learning. While we have empirically shown that the deep network architecture presented in this paper has favorable properties, exploration of other deep architectures is a promising direction for future work.

Acknowledgements.

We thank David Rohde for several helpful discussions.

References

  • [1] R. Affandi, E. Fox, R. Adams, and B. Taskar. Learning the parameters of determinantal point process kernels. In ICML, 2014.
  • [2] Alexei Borodin. Determinantal Point Processes. arXiv:0911.1153, 2009.
  • [3] Wei-Lun Chao, Boqing Gong, Kristen Grauman, and Fei Sha. Large-margin determinantal point processes. In

    Uncertainty in Artificial Intelligence (UAI)

    , 2015.
  • [4] D Chen. Data mining for the online retail industry: A case study of rfm model-based customer segmentation using data mining. Journal of Database Marketing and Customer Strategy Management, 19(3), August 2012.
  • [5] Laurent Decreusefond, Ian Flint, Nicolas Privault, and Giovanni Luca Torrisi. Determinantal Point Processes, 2015.
  • [6] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Low-rank factorization of Determinantal Point Processes. In AAAI, 2017.
  • [7] J. Gillenwater. Approximate Inference for Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2014.
  • [8] Yifan Hu, Yehuda Koren, and Chris Volinsky. Collaborative filtering for implicit feedback datasets. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, 2008.
  • [9] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [10] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Self-normalizing neural networks. In NIPS, 2017.
  • [11] Andreas Krause, Ajit Singh, and Carlos Guestrin. Near-optimal sensor placements in Gaussian processes: theory, efficient algorithms and empirical studies. JMLR, 9:235–284, 2008.
  • [12] Maciej Kula. Metadata embeddings for user and item cold-start recommendations. arXiv preprint arXiv:1507.08439, 2015.
  • [13] A. Kulesza. Learning with Determinantal Point Processes. PhD thesis, University of Pennsylvania, 2013.
  • [14] A. Kulesza and B. Taskar. k-dpps: Fixed-size determinantal point processes. In ICML, 2011.
  • [15] A. Kulesza and B. Taskar. Determinantal Point Processes for machine learning, volume 5. Foundations and Trends in Machine Learning, 2012.
  • [16] Alex Kulesza and Ben Taskar. Learning determinantal point processes. In UAI, 2011.
  • [17] Frédéric Lavancier, Jesper Møller, and Ege Rubak. Determinantal Point Process models and statistical inference. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 77(4):853–877, 2015.
  • [18] Yanen Li, Jia Hu, ChengXiang Zhai, and Ye Chen. Improving one-class collaborative filtering by incorporating rich user information. In Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM ’10, 2010.
  • [19] H. Lin and J. Bilmes. Learning mixtures of submodular shells with application to document summarization. In Uncertainty in Artificial Intelligence (UAI), 2012.
  • [20] Zelda Mariet and Suvrit Sra. Fixed-point algorithms for learning Determinantal Point Processes. In ICML, 2015.
  • [21] Zelda Mariet and Suvrit Sra. Diversity networks. Int. Conf. on Learning Representations (ICLR), 2016.
  • [22] Takayuki Osogami, Rudy Raymond, Akshay Goel, Tomoyuki Shirai, and Takanori Maehara. Dynamic Determinantal Point Processes. In AAAI, 2018.
  • [23] Benjamin Recht, Christopher Re, Stephen Wright, and Feng Niu.

    Hogwild: A lock-free approach to parallelizing stochastic gradient descent.

    In NIPS, 2011.
  • [24] Flavian Vasile, Elena Smirnova, and Alexis Conneau. Meta-prod2vec: Product embeddings using side-information for recommendation. In RecSys, 2016.
  • [25] Pengtao Xie, Ruslan Salakhutdinov, Luntian Mou, and Eric P Xing. Deep determinantal point process for large-scale multi-label classification. In ICCV, pages 473–482, 2017.
  • [26] Cheng Zhang, Hedvig Kjellström, and Stephan Mandt. Stochastic learning on imbalanced data: Determinantal Point Processes for mini-batch diversification. CoRR, abs/1705.00607, 2017.

Appendix A Mean Percentile Rank (MPR)

We begin our definition of MPR by defining percentile rank (PR). First, given a set , let . The percentile rank of an item given a set is defined as

where indicates those elements in the ground set that are not found in .

MPR is then computed as

where is the set of test instances and is a randomly selected element in each set . A MPR of 50 is equivalent to random selection; a MPR of 100 indicates that the model perfectly predicts the held out item.

Appendix B Experimental Results for Instacart Dataset

Number of hidden layers Instacart without metadata
All Baskets Small Baskets Medium Baskets Large Baskets
0 85.28 1.69 85.28 1.69 85.27 1.78 85.64 2.96
1 85.83 1.52 85.83 1.52 86.24 1.38 86.06 2.74
2 88.60 0.91 88.60 0.91 89.11 1.37 88.63 2.92
3 88.59 1.91 88.59 1.91 88.95 0.84 87.51 2.28
Table 2: MPR results for the Instacart dataset, without metadata. Results show mean and standard deviation estimates obtained using bootstrapping. Bold values indicate improvement over the low-rank DPP (with 0 hidden layers) outside the standard deviation.

We perform next-item prediction experiments on another dataset composed of purchased shopping baskets: the Instacart dataset 222https://www.instacart.com/datasets/grocery-shopping-2017. As with the UK retail dataset, we omit all baskets with more than 100 items, which allows us to use a low-rank factorization of the DPP () that scales well in training and prediction time, while also keeping memory consumption for model parameters to a manageable level. For the Instacart dataset, we run experiments on a random sample of 100,000 baskets from the full dataset of 3.2 million baskets, with a catalog of 35,180 items for this random sample. For all experiments, a random selection of 80% of the baskets are used for training, and the remaining 20% are used for testing.

Table 2 shows the results of our experiments on the Instacart dataset. As with the UK dataset, we compute MPR on all test baskets, and we also computed results on the test set divided into three equally-sized populations segmented according to basket size. We see that the deep DPP is able to show moderate to significant improvements in MPR over the standard low-rank DPP (with 0 hidden layers) for all baskets, medium baskets, and large baskets.