1. Introduction
Determinantal Point Processes enable a form of subsampling that generalises sampling without replacement: from a set of items , we draw a random subset such that preserves some of the diversity in . Here we focus solely on discrete DPPs, where has elements. A generic algorithm for exact sampling from discrete DPPs was given in [hough_determinantal_2006], and popularised by [kulesza_determinantal_2012]. Implemented naively, this algorithm has cost . The point of this note is to derive and describe a simpler algorithm with cost . We make no great claim to novelty, as other algorithms with the same asymptotic cost exist, but the one we give is very easily stated and trivial to implement.
1.1. Notations
Sets are in uppercase calligraphic letters:
. Vectors are in lowercase bold letters:
. The th entry of vector is written either or . For instance, is the vector such that: and . Matrices are in upper case letters: . For instanceis the identity matrix in dimension
. If the dimension is not specified, it can be guessed via the context. The th element of matrix is written and its th column is written . is the restriction of matrix to the rows (resp. columns) indexed by the elements of (resp. ). We write as a shorthand for . Moreover, stands for the set of first integers . Finally, the notation stands for the determinant of matrix .1.2. Definitions
We give here a definition of DPPs via
ensembles. Equivalent formulations are based on the marginal kernel and specify marginal probabilities (see
e.g. [kulesza_determinantal_2012] for details).Definition 1.1 (Determinantal Point Process).
Consider a point process, i.e., a process that randomly draws an element . It is determinantal if there exists a semidefinite positive (SDP) matrix such that the probability of sampling is:
is called the ensemble of the DPP.
being SDP it is diagonalisable in:
(1) 
with its set of orthonormal eigenvectors and the diagonal matrix of sorted eigenvalues: .
The number of samples of a DPP is random and is distributed according to a sum of Bernoulli variables of parameters [kulesza_determinantal_2012]. The expected number of samples is thus
and the variance of the number of samples is
In many cases, it is preferable to constrain the DPP to output a fixed number of samples . This leads to DPPs:
Definition 1.2 (Dpp [kulesza_determinantal_2012]).
Consider a point process that randomly draws an element . This process is a DPP with ensemble if:


, with the constant s.t. .
1.3. The standard DPP sampling algorithm
The standard algorithm to sample a DPP from a ensemble given its eigendecomposition may be decomposed in two steps [hough_determinantal_2006]:

Sample eigenvectors. Draw Bernoulli variables with parameters : for , add to the set of sampled indices with probability . We generically denote by the number of elements in . Note that the expected value of is .

Run alg. 1 to sample a DPP with projective ensemble where concatenates all eigenvectors such that . Note that .
For the proof that the combination of these two steps samples a DPP with ensemble , we refer the reader to the papers [hough_determinantal_2006, kulesza_determinantal_2012].
The expensive step in alg. 1 is the orthogonalisation, which could be sped up via lowrank updates to a QR or SVD decomposition. We suggest a more direct route, leading to algorithm 3. To derive this algorithm, we first introduce alg. 2 as a stepping stone to simplify the proofs, but readers only interested in the implementation can skip ahead to alg. 3.
2. Improved sampling algorithm
2.1. Primal representation
Proof.
Let us denote by the output of alg. 2. Let us also denote by (resp. ) the sample set (resp. the value of ) at the end of the th iteration of the loop of alg. 2. We have : . Using the Schur complement, we have :
(2) 
Given Eq. (2.1), and as is SDP by construction (such that ), one can show that and : at each iteration , the probability is thus well defined. The loop being repeated times, the number of samples of the output is thus necessarily equal to . Thus, for all of size different than .
Let us now show that is indeed of determinantal form when . By construction of :
(3) 
Writing Eq. (2.1) for , and iterating, one obtains : . Let us finish by showing that the denominator of Eq (3) does not depend on the chosen samples. This is where the projective assumption of is essential. One has :
We have . Moreover, by invariance of the trace to circular permutations, we have:
Thus
(4)  
(5)  and 
which ends the proof. ∎
Note that both Algorithms 1 and 2 have a computation cost of order such that the overall sampling algorithm given the eigendecomposition of is in average of the order . This is suboptimal. In fact, the scalar is computed from scratch at each iteration of the loop even though one could use Woodbury’s identity to take advantage of computations done at past iterations. This observation leads to alg. 3. We have the following equivalence:
Proof.
Let us denote by (resp. ) the sample set (resp. the value of ) at the end of the th iteration of the loop. Let us also denote by the initial value of . All we need to show is that the are equal in both algorithms. In alg. 3: (where the vectors and are defined in the algorithm). Comparing with the of alg. 2, all we need to show is:
(6) 
We will show more generally that:
(7) 
To do so, we propose a recurrence. Before we start, let us note that, by definition:
(8) 
Initialisation. It is true for , where is reduced to and:
(9) 
is a scalar. Indeed, the following holds for all :
(10) 
Hypothesis. We assume that Eq. (7) is true at iteration .
Recurrence. Let us show it is also true at iteration . Using Woodbury’s identity on , we show that:
(11) 
where . Applying the hypothesis to in Eq. (11), the proof boils down to showing that:
(12) 
By construction of Algorithm 3, reads :
(13) 
Applying once more the hypothesis on and , one obtains:
(14) 
which shows Eq. (12) and ends the proof. ∎
Alg. 3 samples a DPP with a computation cost of order instead of , thereby gaining an order of magnitude. Moreover, it is straightforward to implement.
2.2. Dual (lowrank) representation
The gain obtained in the previous section is not significant in the general case, as the limiting step of the overall sampling algorithm is any case the diagonalisation of , that costs . Thankfully, in many applications, a dual representation of exists, i.e., a representation of in a lowrank form
(15) 
where with a dimension that can be significantly smaller than . In this case, we will see that preferring alg. 3 to alg. 1 does induce a significant gain in the overall computation time. The dual representation enables us to circumvent the diagonalization cost of and only diagonalize the dual form:
(16) 
costing only (time to compute from and to compute the lowdimensional diagonalization). ’s eigendecomposition yields:
(17) 
with the basis of eigenvectors and the diagonal matrix of eigenvalues such that . One can show (e.g., see Proposition 3.1 in [kulesza_determinantal_2012]) that all eigenvectors associated to nonzero eigenvalues of can be recovered from ’s eigendecomposition. More precisely, if is an eigenvector of associated to eigenvalue , then:
(18) 
is a normalized eigenvector of with same eigenvalue. Thus, given the eigendecomposition of the dual form, the standard algorithm boils down to:

Sample eigenvectors. Draw Bernoulli variables with parameters : for , add to the set of sampled indices with probability . Denote by the number of elements of . is necessarily smaller than .
The reconstruction operation costs . Thus, preferring alg. 3 to alg. 1 in step ii/ lowers the total computation time of sampling a DPP from in average to (as is necessarily larger than ).
A last algorithm in case of memory issues. If, for memory optimization reasons, one does not desire to reconstruct the eigenvectors in dimension , one may slightly alter alg. 3 by noticing that , in the first line of the algorithm, reads:
(19) 
One may thus first precompute , then directly work with instead of and replace 1) by and 2) all scalar products of the type by . This leads to alg. 4. This algorithm is nevertheless slightly heavier computationally than alg. 3: it indeed runs in instead of .
Proof.
We show a pointwise equivalence. Given Eq. (19), the initial value of is the same in both algorithms:
(20) 
More generally, the following holds for all :
(21) 
To differentiate notations, let us write the vectors defined in alg. 4, and keep the notation for the ones defined in alg. 3. We then need to show that the values of are the same in both algorithm, that is:
(22) 
We prove this with a recurrence. Initialization. It is true for :
(23) 
Hypothesis. We assume it true for all integers strictly inferior to . Recurrence. Let us show it is true for . We have:
(24) 
such that, by hypothesis, and using Eq. (21)
(25) 
thus ending the proof. ∎
2.3. Implementation details
Numerical difficulties can creep in when is large, regardless of which algorithm is used. From a quick glance at alg. 4, we see that decreases at every step, and when implemented in finite precision it may go negative. We suggest setting it to zero manually at indices already sampled, and setting small negative values to 0. The difficulties are greatest when sampling DPPs that are highly repulsive, in which case one may not be able to sample more than a few points (unless one is willing to implement the algorithm in multiple precision arithmetic, which is rather slow to run).
Depending on the programming language and the linear algebra backend, it may be much more efficient to implement the following step:
(26) 
as two matrix multiplications rather than a loop over . Simply stack in a matrix , and compute the equivalent form:
(27) 
One benefit of the above formulation is that it can take advantage of a parallel BLAS (and indeed, this is the only step in alg. 3 where parallelisation could be used in a meaningful way).
3. Conclusion
We have described two algorithms for exact sampling of discrete DPPs. For lowrank ensembles, we have found in practice that wellimplemented exact algorithms are often competitive with approximate samplers like the Gibbs sampler [li_fast_2016]. The great challenge for both types of algorithms lies in scaling in : it is easy enough to sample from very large datasets ( in the millions), but because of the quadratic scaling in one is limited to taking small samples ( in the hundreds). Because increasing also means increasing the rank of the Lmatrix, one ends up paying a double penalty: when forming the ensemble, and during sampling.
One solution would be to formulate a sparse ensemble, in which case alg. 3 carries over in sparse matrix algebra. Another, perhaps simpler approach is to subdivide the dataset ahead of time and use DPPs in each subset. We plan to investigate these possibilities in future work.
Comments
There are no comments yet.