DeterminantalPointProcesses.jl
Julia package mirror.
view repo
Determinantal Point Processes (DPPs) are probabilistic models over all subsets a ground set of N items. They have recently gained prominence in several applications that rely on "diverse" subsets. However, their applicability to large problems is still limited due to the O(N^3) complexity of core tasks such as sampling and learning. We enable efficient sampling and learning for DPPs by introducing KronDPP, a DPP model whose kernel matrix decomposes as a tensor product of multiple smaller kernel matrices. This decomposition immediately enables fast exact sampling. But contrary to what one may expect, leveraging the Kronecker product structure for speeding up DPP learning turns out to be more difficult. We overcome this challenge, and derive batch and stochastic optimization algorithms for efficiently learning the parameters of a KronDPP.
READ FULL TEXT VIEW PDF
Determinantal point processes (DPPs) are a useful probabilistic model fo...
read it
Determinantal Point Processes (DPPs) are probabilistic models that arise...
read it
Determinantal point processes (DPPs) enable the modelling of repulsion: ...
read it
Determinantal Point Processes (DPPs) provide an elegant and versatile wa...
read it
By using the framework of Determinantal Point Processes (DPPs), some
the...
read it
In this technical report, we discuss several sampling algorithms for
Det...
read it
We here focus on the task of learning Granger causality matrices for
mul...
read it
Julia package mirror.
Determinantal Point Processes (DPPs) are discrete probability models over the subsets of a ground set of
items. They provide an elegant model to assign probabilities to an exponentially large sample, while permitting tractable (polynomial time) sampling and marginalization. They are often used to provide models that balance “diversity” and quality, characteristics valuable to numerous problems in machine learning and related areas
(Kulesza and Taskar, 2012).The antecedents of DPPs lie in statistical mechanics (Macchi, 1975), but since the seminal work of (Kulesza, 2013) they have made inroads into machine learning. By now they have been applied to a variety of problems such as document and video summarization (Lin and Bilmes, 2012; Chao et al., 2015), sensor placement (Krause et al., 2008), recommender systems (Zhou et al., 2010), and object retrieval (Affandi et al., 2014)
. More recently, they have been used to compress fullyconnected layers in neural networks
(Mariet and Sra, 2016) and to provide optimal sampling procedures for the Nyström method (Li et al., 2016). The more general study of DPP properties has also garnered a significant amount of interest, see e.g., (Hough et al., 2006; Kulesza and Taskar, 2011, 2012; Affandi et al., 2013; Decreusefond et al., 2015; Lavancier et al., 2015; Borodin, 2009; Lyons, 2003).However, despite their elegance and tractability, widespread adoption of DPPs is impeded by the cost of basic tasks such as (exact) sampling (Hough et al., 2006; Kulesza and Taskar, 2012) and learning (Hough et al., 2006; Kulesza and Taskar, 2012; Gillenwater et al., 2014; Mariet and Sra, 2015). This cost has motivated a string of recent works on approximate sampling methods such as MCMC samplers (Kang, 2013; Li et al., 2016) or coreset based samplers (Li et al., 2015). The task of learning a DPP from data has received less attention; the methods of (Gillenwater et al., 2014; Mariet and Sra, 2015) cost per iteration, which is clearly unacceptable for realistic settings. This burden is partially ameliorated in (Gartrell et al., 2016), who restrict to learning lowrank DPPs, though at the expense of being unable to sample subsets larger than the chosen rank.
These considerations motivate us to introduce KronDpp, a DPP model that uses Kronecker (tensor) product kernels. As a result, KronDpp enables us to learn large sized DPP kernels, while also permitting efficient (exact and approximate) sampling. The use of Kronecker products to scale matrix models is a popular and effective idea in several machinelearning settings (Wu et al., 2005; Martens and Grosse, 2015; Flaxman et al., 2015; Zhang et al., 2015). But as we will see, its efficient execution for DPPs turns out to be surprisingly challenging.
To make our discussion more concrete, we recall some basic facts now. Suppose we have a ground set of items . A discrete DPP over is a probability measure on parametrized by a positive definite matrix (the marginal kernel) such that , so that for any drawn from , the measure satisfies
(1) 
where is the submatrix of indexed by elements in (i.e., ). If a DPP with marginal kernel assigns nonzero probability to the empty set, the DPP can alternatively be parametrized by a positive definite matrix (the DPP kernel) so that
(2) 
A brief manipulation (see e.g., (Kulesza and Taskar, 2012, Eq. 15)) shows that when the inverse exists, . The determinants, such as in the normalization constant in (2), make operations over DPPs typically cost , which is a key impediment to their scalability.
Therefore, if we consider a class of DPP kernels whose structure makes it easy to compute determinants, we should be able to scale up DPPs. An alternative approach towards scalability is to restrict the size of the subsets, as done in DPP (Kulesza and Taskar, 2011) or when using rank DPP kernels (Gartrell et al., 2016) (where ). Both these approaches still require preprocessing for exact sampling; another caveat is that they limit the DPP model by assigning zero probabilities to sets of cardinality greater than .
In contrast, KronDpp uses a kernel matrix of the form , where each subkernel is a smaller positive definite matrix. This decomposition has two key advantages: (i) it significantly lowers the number of parameters required to specify the DPP from to (assuming the subkernels are roughly the same size); and (ii) it enables fast sampling and learning.
For ease of exposition, we describe specific details of KronDpp for ; as will become clear from the analysis, typically the special cases and should suffice to obtain lowcomplexity sampling and learning algorithms.
Our main contribution is the KronDpp model along with efficient algorithms for sampling from it and learning a Kronecker factored kernel. Specifically, inspired by the algorithm of (Mariet and Sra, 2015), we develop KrKPicard (KroneckerK
ernel Picard), a blockcoordinate ascent procedure that generates a sequence of Kronecker factored estimates of the DPP kernel while ensuring monotonic progress on its (difficult, nonconvex) objective function. More importantly, we show how to implement
KrKPicard to run in time when implemented as a batch method, and in time and space, when implemented as a stochastic method. As alluded to above, unlike many other uses of Kronecker models, KronDpp does not admit trivial scaling up, largely due to extensive dependence of DPPs on arbitrary submatrices of the DPP kernel. An interesting theoretical nugget that arises from our analysis is the combinatorial problem that we call subset clustering, a problem whose (even approximate) solution can lead to further speedups of our algorithms.We begin by recalling basic properties of Kronecker products needed in our analysis; we omit proofs of these wellknown results for brevity. The Kronecker (tensor) product of with two matrices is defined as the block matrix .
We denote the block in by for any valid pair , and extend the notation to nonKronecker product matrices to indicate the submatrix of size at position .
Let be matrices of sizes so that and are welldefined. Then,
If , then, ;
If and are invertible then so is , with ;
= .
An important consequence of Prop. 2.1 is the following corollary.
We will also need the notion of partial trace operators, which are perhaps less wellknown:
Let . The partial traces and are defined as follows:
The action of partial traces is easy to visualize: indeed, and . For us, the most important property of partial trace operators is their positivity.
and are positive operators, i.e., for , and .
Please refer to (Bhatia, 2007, Chap. 4). ∎
In this section, we consider the key difficult task for KronDpps: learning a Kronecker product kernel matrix from observed subsets . Using the definition (2) of , maximumlikelihood learning of a DPP with kernel results in the optimization problem:
(3) 
This problem is nonconvex and conjectured to be NPhard (Kulesza, 2013, Conjecture 4.1). Moreover the constraint is nontrivial to handle. Writing as the indicator matrix for of size so that , the gradient of is easily seen to be
(4) 
In (Mariet and Sra, 2015), the authors derived an iterative method (“the Picard iteration”) for computing an that solves by running the simple iteration
(5) 
Moreover, iteration (5) is guaranteed to monotonically increase the loglikelihood (Mariet and Sra, 2015). But these benefits accrue at a cost of per iteration, and furthermore a direct application of (5) cannot guarantee the Kronecker structure required by KronDpp.
Our aim is to obtain an efficient algorithm to (locally) optimize (3). Beyond its nonconvexity, the Kronecker structure imposes another constraint. As in (Mariet and Sra, 2015) we first rewrite as a function of , and rearrange terms to write it as
(6) 
It is easy to see that is concave, while a short argument shows that is convex (Mariet and Sra, 2015). An appeal to the convexconcave procedure (Yuille and Rangarajan, 2002) then shows that updating by solving , which is what (5) does (Mariet and Sra, 2015, Thm. 2.2), is guaranteed to monotonically increase .
But for KronDpp this idea does not apply so easily: due the constraint the function
fails to be convex, precluding an easy generalization. Nevertheless, for fixed or the functions
are once again concave or convex. Indeed, the map is linear and is concave, and is also concave; similarly, is seen to be concave and and are convex. Hence, by generalizing the arguments of (Yuille and Rangarajan, 2002, Thm. 2) to our “blockcoordinate” setting, updating via
(7) 
should increase the loglikelihood at each iteration. We prove below that this is indeed the case, and that updating as per (7) ensure positive definiteness of the iterates as well as monotonic ascent.
In order to show the positive definiteness of the solutions to (7), we first derive their closed form.
For , , the solutions to (7) are given by the following expressions:
Moreover, these solutions are positive definite.
We are now ready to establish that these updates ensure monotonic ascent in the loglikelihood.
Starting with , , updating according to (7) generates positive definite iterates and , and the sequence is nondecreasing.
Updating according to (7) generates positive definite matrices , and hence positive definite subkernels . Moreover, due to the convexity of and concavity of , for matrices
Hence, .
Thus, if verify (7), by setting and we have
The same reasoning holds for , which proves the theorem. ∎
As (and similarly for ), updating as in (7) is equivalent to updating
Genearlization. We can generalize the updates to take an additional stepsize parameter :
Experimentally, (as long as the updates remain positive definite) can provide faster convergence, although the monotonicity of the loglikelihood is no longer guaranteed. We found experimentally that the range of admissible is larger than for Picard, but decreases as grows larger.
The arguments above easily generalize to the multiblock case. Thus, when learning , by writing the matrix with a 1 in position and zeros elsewhere, we update as
From the above updates it is not transparent whether the Kronecker product saves us any computation. In particular, it is not clear whether the updates can be implemented to run faster than . We show below in the next section how to implement these updates efficiently.
From Theorem 3.2, we obtain Algorithm 1 (which is different from the Picard iteration in (Mariet and Sra, 2015), because it operates alternatingly on each subkernel). It is important to note that a further speedup to Algorithm 1 can be obtained by performing stochastic updates, i.e., instead of computing the full gradient of the loglikelihood, we perform our updates using only one (or a small minibatch) subset at each step instead of iterating over the entire training set; this uses the stochastic gradient .
The crucial strength of Algorithm 1 lies in the following result:
For , the updates in Algorithm 1 can be computed in time and space, where is the size of the largest training subset. Furthermore, stochastic updates can be computed in time and space.
Indeed, by leveraging the properties of the Kronecker product, the updates can be obtained without computing . This result is nontrivial: the components of , and , must be considered separately for computational efficiency. The proof is provided in App. B. However, it seems that considering more than 2 subkernels does not lead to further speedups.
If , these complexities become:
for nonstochastic updates: time, space,
for stochastic updates: time, space.
This is a marked improvement over (Mariet and Sra, 2015), which runs in space and time (nonstochastic) or time (stochastic); Algorithm 1 also provides faster stochastic updates than (Gartrell et al., 2016). However, one may wonder if by learning the subkernels by alternating updates the loglikelihood converges to a suboptimal limit. The next section discusses how to jointly update and .
We also analyzed the possibility of updating and jointly: we update and then recover the Kronecker structure of the kernel by defining the updates and such that:
(8) 
We show in appendix C
that such solutions exist and can be computed by from the first singular value and vectors of the matrix
. Note however that in this case, there is no guaranteed increase in loglikelihood. The pseudocode for the related algorithm (JointPicard) is given in appendix C.1. An analysis similar to the proof of Thm. 3.3 shows that the updates can be obtained .Although KronDpps have tractable learning algorithms, the memory requirements remain high for nonstochastic updates, as the matrix needs to be stored, requiring memory. However, if the training set can be subdivised such that
(9) 
can be decomposed as with . Due to the bound in Eq. 9, each will be sparse, with only nonzero coefficients. We can then store each with minimal storage and update and in time and space.
Determining the existence of such a partition of size is a variant of the NPHard SubsetUnion Knapsack Problem (SUKP) (Goldschmidt et al., 1994) with knapsacks and where the value of each item (i.e. each ) is equal to 1: a solution to SUKP of value with knapsacks is equivalent to a solution to 9. However, an approximate partition can also be simply constructed via a greedy algorithm.
Sampling exactly (see Alg. 2 and (Kulesza and Taskar, 2012)) from a full DPP kernel costs where is the size of the sampled subset. The bulk of the computation lies in the initial eigendecomposition of ; the orthonormalizations cost . Although the eigendecomposition need only happen once for many iterations of sampling, exact sampling is nonetheless intractable in practice for large .
It follows from Prop. 2.2 that for KronDpps
, the eigenvalues
can be obtained in , and the eigenvectors in operations. For , exact sampling thus only costs . If , the same reasoning shows that exact sampling becomes linear in , only requiring operations.One can also resort to MCMC sampling; for instance such a sampler was considered in (Kang, 2013) (though with an incorrect mixing time analysis). The results of (Li et al., 2016) hold only for DPPs, but suggest their MCMC sampler may possibly take time for full DPPs, which is impractical. Nevertheless if one develops faster MCMC samplers, they should also be able to profit from the Kronecker product structure offered by KronDpp.
In order to validate our learning algorithm, we compared KrKPicard to JointPicard and to the Picard iteration (Picard) on multiple real and synthetic datasets.^{1}^{1}1All experiments were repeated 5 times and averaged, using MATLAB on a Linux Mint system with 16GB of RAM and an i74710HQ CPU @ 2.50GHz.
All three algorithms were used to learn from synthetic data drawn from a “true” kernel. The subkernels were initialized by , with ’s coefficients drawn uniformly from ; for Picard, was initialized with .
; the thin dotted lines indicated the standard deviation from the mean.
, training data was generated by sampling 100 subsets from the true kernel with sizes uniformly distributed between 10 and 190.
To evaluate KrKPicard on matrices too large to fit in memory and with large , we drew samples from a kernel of rank (on average ), and learned the kernel stochastically (only KrKPicard could be run due to the memory requirements of other methods); the likelihood drastically improves in only two steps (Fig.0(c)).
As shown in Figures 0(a) and 0(b), KrKPicard converges significantly faster than Picard, especially for large values of . However, although JointPicard also increases the loglikelihood at each iteration, it converges much slower and has a high standard deviation, whereas the standard deviations for Picard and KrKPicard are barely noticeable. For these reasons, we drop the comparison to JointPicard in the subsequent experiments.
We compared KrKPicard to Picard and EM (Gillenwater et al., 2014) on the baby registry dataset (described indepth in (Gillenwater et al., 2014)), which has also been used to evaluate other DPP learning algorithms (Gillenwater et al., 2014; Mariet and Sra, 2015; Gartrell et al., 2016). The dataset contains 17 categories of babyrelated products obtained from Amazon. We learned kernels for the 6 largest categories (); in this case, Picard is sufficiently efficient to be prefered to KrKPicard; this comparison serves only to evaluate the quality of the final kernel estimates.
The initial marginal kernel for EM was sampled from a Wishart distribution with degrees of freedom and an identity covariance matrix, then scaled by ; for Picard, was set to ; for KrKPicard, and were chosen (as in JointPicard) by minimizing . Convergence was determined when the objective change dipped below a threshold . As one EM iteration takes longer than one Picard iteration but increases the likelihood more, we set and .
The final loglikelihoods are shown in Table 1; we set the stepsizes to their largest possible values, i.e. and . Table 1 shows that KrKPicard obtains comparable, albeit slightly worse loglikelihoods than Picard and EM, which confirms that for tractable , the better modeling capability of full kernels make them preferable to KronDpps.


Finally, to evaluate KrKPicard on large matrices of realworld data, we train it on data from the GENES (Batmanghelich et al., 2014) dataset (which has also been used to evaluate DPPs in (Li et al., 2015; Batmanghelich et al., 2014)). This dataset consists in 10,000 genes, each represented by 331 features corresponding to the distance of a gene to hubs in the BioGRID gene interaction network.
Picard  KrKPicard  KrKPicard (stochastic)  

Average runtime  161.5 17.7 s  8.9 0.2 s  1.2 0.02 s 
NLL increase (1st iter.) 
We construct a ground truth Gaussian DPP kernel on the GENES dataset and use it to obtain 100 training samples with sizes uniformly distributed between 50 and 200 items. Similarly to the synthetic experiments, we initialized KrKPicard’s kernel by setting where
is a random matrix of size
; for Picard, we set the initial kernel .Figure 2 shows the performance of both algorithms. As with the synthetic experiments, KrKPicard converges much faster; stochastic updates increase its performance even more, as shown in Fig. 1(b). Average runtimes and speedup are given in Table 2: KrKPicard runs almost an order of magnitude faster than Picard, and stochastic updates are more than two orders of magnitude faster, while providing slightly larger initial increases to the loglikelihood.
We introduced KronDpps, a variant of DPPs with kernels structured as the Kronecker product of smaller matrices, and showed that typical operations over DPPs such as sampling and learning the kernel from data can be made efficient for KronDpps on previously untractable ground set sizes.
By carefully leveraging the properties of the Kronecker product, we derived for a lowcomplexity algorithm to learn the kernel from data which guarantees positive iterates and a monotonic increase of the loglikelihood, and runs in time. This algorithm provides even more significant speedups and memory gains in the stochastic case, requiring only time and space. Experiments on synthetic and real data showed that KronDpps can be learned efficiently on sets large enough that does not fit in memory.
While discussing learning the kernel, we showed that and cannot be updated simultaneously in a CCCPstyle iteration since is not convex over . However, it can be shown that is geodesically convex over the Riemannian manifold of positive definite matrices, which suggests that deriving an iteration which would take advantage of the intrinsic geometry of the problem may be a viable line of future work.
KronDpps also enable fast sampling, in operations when using two subkernels and in when using three subkernels; this allows for exact sampling at comparable or even better costs than previous algorithms for approximate sampling. However, as we improve computational efficiency , the subset size becomes limiting, due to the cost of sampling and learning. A necessary line of future work to allow for truly scalable DPPs is thus to overcome this computational bottleneck.
Learning mixtures of submodular shells with application to document summarization.
In Uncertainty in Artificial Intelligence (UAI), 2012.We use ‘’ to denote the operator that stacks columns of a matrix to form a vector; conversely, ‘’ takes a vector with coefficients and returns a matrix.
Let , and . We note the matrix with all zeros except for a 1 at position , its size being clear from context. We wish to solve
(10) 
It follows from the fact that
that and . Moreover, we know that
The Jacobian of is given by . Hence,
The last equivalence is simply the result of indices manipulation. Thus, we have
The updates to and are obtained efficiently through different methods; hence, the proof to Thm. 3.3 is split into two sections. We write
so that . Recall that .
We wish to compute efficiently. We have
The matrix can be computed in simply by precomputing in and then computing all traces in time. When doing stochastic updates for which is sparse with only nonzero coefficients, computing can be done in .
By diagonalizing and , we have with and . and can all be obtained in as a consequence of Prop. 2.1. Then
Let , which can be computed in . Then is computable in .
Overall, the update to can be computed in , or in if the updates are stochastic. Moreover, if is sparse with only nonzero coefficients (for stochastic updates ), can be computed in space, leading to an overall memory cost.
We wish to compute efficiently.
Comments
There are no comments yet.