 # Optimized Algorithms to Sample Determinantal Point Processes

In this technical report, we discuss several sampling algorithms for Determinantal Point Processes (DPP). DPPs have recently gained a broad interest in the machine learning and statistics literature as random point processes with negative correlation, i.e., ones that can generate a "diverse" sample from a set of items. They are parametrized by a matrix L, called L-ensemble, that encodes the correlations between items. The standard sampling algorithm is separated in three phases: 1/ eigendecomposition of L, 2/ an eigenvector sampling phase where L's eigenvectors are sampled independently via a Bernoulli variable parametrized by their associated eigenvalue, 3/ a Gram-Schmidt-type orthogonalisation procedure of the sampled eigenvectors. In a naive implementation, the computational cost of the third step is on average O(Nμ^3) where μ is the average number of samples of the DPP. We give an algorithm which runs in O(Nμ^2) and is extremely simple to implement. If memory is a constraint, we also describe a dual variant with reduced memory costs. In addition, we discuss implementation details often missing in the literature.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Determinantal Point Processes enable a form of subsampling that generalises sampling without replacement: from a set of items , we draw a random subset such that preserves some of the diversity in . Here we focus solely on discrete DPPs, where has elements. A generic algorithm for exact sampling from discrete DPPs was given in [hough_determinantal_2006], and popularised by [kulesza_determinantal_2012]. Implemented naively, this algorithm has cost . The point of this note is to derive and describe a simpler algorithm with cost . We make no great claim to novelty, as other algorithms with the same asymptotic cost exist, but the one we give is very easily stated and trivial to implement.

### 1.1. Notations

Sets are in upper-case calligraphic letters:

. Vectors are in lower-case bold letters:

. The -th entry of vector is written either or . For instance, is the vector such that: and . Matrices are in upper case letters: . For instance

is the identity matrix in dimension

. If the dimension is not specified, it can be guessed via the context. The -th element of matrix is written and its -th column is written . is the restriction of matrix to the rows (resp. columns) indexed by the elements of (resp. ). We write as a shorthand for . Moreover, stands for the set of first integers . Finally, the notation stands for the determinant of matrix .

### 1.2. Definitions

We give here a definition of DPPs via

-ensembles. Equivalent formulations are based on the marginal kernel and specify marginal probabilities (see

e.g. [kulesza_determinantal_2012] for details).

###### Definition 1.1 (Determinantal Point Process).

Consider a point process, i.e., a process that randomly draws an element . It is determinantal if there exists a semi-definite positive (SDP) matrix such that the probability of sampling is:

 P(S)=det(LS)det(I+L).

is called the -ensemble of the DPP.

being SDP it is diagonalisable in:

 (1) L=UΛU⊤,

with its set of orthonormal eigenvectors and the diagonal matrix of sorted eigenvalues: .

The number of samples of a DPP is random and is distributed according to a sum of Bernoulli variables of parameters  [kulesza_determinantal_2012]. The expected number of samples is thus

 μ=E(|S|)=∑nλn1+λn

and the variance of the number of samples is

 Var(|S|)=∑nλn(1+λn)2.

In many cases, it is preferable to constrain the DPP to output a fixed number of samples . This leads to -DPPs:

###### Definition 1.2 (k-Dpp [kulesza_determinantal_2012]).

Consider a point process that randomly draws an element . This process is a -DPP with -ensemble if:

1. , with the constant s.t. .

### 1.3. The standard DPP sampling algorithm

The standard algorithm to sample a DPP from a -ensemble given its eigendecomposition may be decomposed in two steps [hough_determinantal_2006]:

1. Sample eigenvectors. Draw Bernoulli variables with parameters : for , add to the set of sampled indices with probability . We generically denote by the number of elements in . Note that the expected value of is .

2. Run alg. 1 to sample a -DPP with projective -ensemble where concatenates all eigenvectors such that . Note that .

For the proof that the combination of these two steps samples a DPP with -ensemble , we refer the reader to the papers [hough_determinantal_2006, kulesza_determinantal_2012].

The expensive step in alg. 1 is the orthogonalisation, which could be sped up via low-rank updates to a QR or SVD decomposition. We suggest a more direct route, leading to algorithm 3. To derive this algorithm, we first introduce alg. 2 as a stepping stone to simplify the proofs, but readers only interested in the implementation can skip ahead to alg. 3.

## 2. Improved sampling algorithm

### 2.1. Primal representation

An alternative to alg. 1 to compute step ii/ above is given in alg. 2. In fact:

###### Lemma 2.1.

Alike alg. 1, alg. 2 samples a -DPP with projective -ensemble .

###### Proof.

Let us denote by the output of alg. 2. Let us also denote by (resp. ) the sample set (resp. the value of ) at the end of the -th iteration of the loop of alg. 2. We have : . Using the Schur complement, we have :

 ∀n∈[1,k],∀idet(PSn−1∪{i}) =(Pii−P⊺Sn−1,iP−1Sn−1PSn−1,i)det(PSn−1) (2) =pn−1(i)det(PSn−1).

Given Eq. (2.1), and as is SDP by construction (such that ), one can show that and : at each iteration , the probability is thus well defined. The loop being repeated times, the number of samples of the output is thus necessarily equal to . Thus, for all of size different than .

Let us now show that is indeed of determinantal form when . By construction of :

 (3) P(S) =k∏n=1P(sn|s1,s2,…,sn−1)=k∏n=1pn−1(sn)∑Ni=1pn−1(i).

Writing Eq. (2.1) for , and iterating, one obtains : . Let us finish by showing that the denominator of Eq (3) does not depend on the chosen samples. This is where the projective assumption of is essential. One has :

 ∀n∈[1,k],N∑i=1pn−1(i)=N∑i=1p0(i)−N∑i=1P⊺Sn−1,iP−1Sn−1PSn−1,i

We have . Moreover, by invariance of the trace to circular permutations, we have:

 =Tr(VV⊺Sn−1,[k]P−1Sn−1VSn−1,[k]V⊺) =Tr(P−1Sn−1VSn−1,[k]V⊺VV⊺Sn−1,[k]) =Tr(P−1Sn−1VSn−1,[k]V⊺Sn−1,[k]) =Tr(P−1Sn−1PSn−1) =Tr(In−1)=n−1,

Thus

 (4) ∀S s.t. |S|≠k, P(S)=0 (5) and ∀S s.t. |S|=k, P(S)=1Zdet(PS)~{}with~{}Z=k∏n=1k−n+1=k!

which ends the proof. ∎

Note that both Algorithms 1 and 2 have a computation cost of order such that the overall sampling algorithm given the eigendecomposition of is in average of the order . This is suboptimal. In fact, the scalar is computed from scratch at each iteration of the loop even though one could use Woodbury’s identity to take advantage of computations done at past iterations. This observation leads to alg. 3. We have the following equivalence:

###### Lemma 2.2.

Alg. 3 is equivalent to alg. 2: it also samples a -DPP with projective -ensemble .

###### Proof.

Let us denote by (resp. ) the sample set (resp. the value of ) at the end of the -th iteration of the loop. Let us also denote by the initial value of . All we need to show is that the are equal in both algorithms. In alg. 3: (where the vectors and are defined in the algorithm). Comparing with the of alg. 2, all we need to show is:

 (6)

 (7) ∀n,∀(i,j)n∑l=1(f⊺lyi)(f⊺lyj)=P⊺Sn,iP−1SnPSn,j.

To do so, we propose a recurrence. Before we start, let us note that, by definition:

 (8) ∀(i,j)y⊺iyj=δ⊺iVV⊺δj=δ⊺iPδj=Pij.

Initialisation. It is true for , where is reduced to and:

 (9) P−1S1=1Ps1,s1

is a scalar. Indeed, the following holds for all :

 (10)

Hypothesis. We assume that Eq. (7) is true at iteration .
Recurrence. Let us show it is also true at iteration . Using Woodbury’s identity on , we show that:

 (11) P⊺Sn,iP−1SnPSn,j

where . Applying the hypothesis to in Eq. (11), the proof boils down to showing that:

 (12) ∀i,j(f⊺nyi)(f⊺nyj)=zn(i)zn(j)zn(sn).

By construction of Algorithm 3, reads :

 (13)

Applying once more the hypothesis on and , one obtains:

 (14) ∀if⊺nyi=zn(i)√zn(sn),

which shows Eq. (12) and ends the proof. ∎

Alg. 3 samples a -DPP with a computation cost of order instead of , thereby gaining an order of magnitude. Moreover, it is straightforward to implement.

### 2.2. Dual (low-rank) representation

The gain obtained in the previous section is not significant in the general case, as the limiting step of the overall sampling algorithm is any case the diagonalisation of , that costs . Thankfully, in many applications, a dual representation of exists, i.e., a representation of in a low-rank form

 (15) L=Ψ⊺Ψ,

where with a dimension that can be significantly smaller than . In this case, we will see that preferring alg. 3 to alg. 1 does induce a significant gain in the overall computation time. The dual representation enables us to circumvent the diagonalization cost of and only diagonalize the dual form:

 (16) C=ΨΨ⊺∈Rd×d,

costing only (time to compute from and to compute the low-dimensional diagonalization). ’s eigendecomposition yields:

 (17) C=RER⊺,

with the basis of eigenvectors and the diagonal matrix of eigenvalues such that . One can show (e.g., see Proposition 3.1 in [kulesza_determinantal_2012]) that all eigenvectors associated to non-zero eigenvalues of can be recovered from ’s eigendecomposition. More precisely, if is an eigenvector of associated to eigenvalue , then:

 (18) uk=1√ekΨ⊺rk

is a normalized eigenvector of with same eigenvalue. Thus, given the eigendecomposition of the dual form, the standard algorithm boils down to:

1. Sample eigenvectors. Draw Bernoulli variables with parameters : for , add to the set of sampled indices with probability . Denote by the number of elements of . is necessarily smaller than .

2. Denote by the matrix concatenating all eigenvectors such that , and the diagonal matrix of associated eigenvalues. Run alg. 12 or 3 to sample a -DPP with projective -ensemble where concatenates the reconstructed eigenvectors in dimension . Note that .

The reconstruction operation costs . Thus, preferring alg. 3 to alg. 1 in step ii/ lowers the total computation time of sampling a DPP from in average to (as is necessarily larger than ).

A last algorithm in case of memory issues. If, for memory optimization reasons, one does not desire to reconstruct the eigenvectors in dimension , one may slightly alter alg. 3 by noticing that , in the first line of the algorithm, reads:

 (19) yi=V⊺δi=~E−1/2W⊺Ψδi=~E−1/2W⊺ψi.

One may thus first precompute , then directly work with instead of and replace 1)  by and 2) all scalar products of the type by . This leads to alg. 4. This algorithm is nevertheless slightly heavier computationally than alg. 3: it indeed runs in instead of .

###### Lemma 2.3.

alg. 4 with inputs and is equivalent to alg. 3 with input .

###### Proof.

We show a point-wise equivalence. Given Eq. (19), the initial value of is the same in both algorithms:

 (20) ∥yi∥2=ψ⊺iW~E−1W⊺ψi=ψ⊺i~Cψi.

More generally, the following holds for all :

 (21) y⊺iyj=ψ⊺i~Cψj.

To differentiate notations, let us write the vectors defined in alg. 4, and keep the notation for the ones defined in alg. 3. We then need to show that the values of are the same in both algorithm, that is:

 (22) ∀i,n~f⊺n~Cψi=f⊺nyi.

We prove this with a recurrence. Initialization. It is true for :

 (23) ~f⊺1~Cψi=ψ⊺s1~Cψi√ψ⊺s1~Cψs1=y⊺s1yi√y⊺s1ys1=f⊺1yi.

Hypothesis. We assume it true for all integers strictly inferior to . Recurrence. Let us show it is true for . We have:

 (24) ~f⊺n~Cψi=ψ⊺sn~Cψi−∑n−1l=1(~f⊺l~Cψi)(~f⊺l~Cψsn)√ψ⊺sn~Cψsn−∑n−1l=1(~f⊺l~Cψsn)2,

such that, by hypothesis, and using Eq. (21)

 (25) ~f⊺n~Cψi=y⊺snyi−∑n−1l=1(f⊺lyi)(f⊺lysn)√y⊺snysn−∑n−1l=1(f⊺lysn)2=f⊺nyi,

thus ending the proof. ∎

### 2.3. Implementation details

Numerical difficulties can creep in when is large, regardless of which algorithm is used. From a quick glance at alg. 4, we see that decreases at every step, and when implemented in finite precision it may go negative. We suggest setting it to zero manually at indices already sampled, and setting small negative values to 0. The difficulties are greatest when sampling DPPs that are highly repulsive, in which case one may not be able to sample more than a few points (unless one is willing to implement the algorithm in multiple precision arithmetic, which is rather slow to run).

Depending on the programming language and the linear algebra backend, it may be much more efficient to implement the following step:

 (26) fn=ysn−n−1∑l=1fl(f⊺lysn)

as two matrix multiplications rather than a loop over . Simply stack in a matrix , and compute the equivalent form:

 (27) fn=ysn−Fn−1Ftn−1ysn

One benefit of the above formulation is that it can take advantage of a parallel BLAS (and indeed, this is the only step in alg. 3 where parallelisation could be used in a meaningful way).

## 3. Conclusion

We have described two algorithms for exact sampling of discrete DPPs. For low-rank -ensembles, we have found in practice that well-implemented exact algorithms are often competitive with approximate samplers like the Gibbs sampler [li_fast_2016]. The great challenge for both types of algorithms lies in scaling in : it is easy enough to sample from very large datasets ( in the millions), but because of the quadratic scaling in one is limited to taking small samples ( in the hundreds). Because increasing also means increasing the rank of the L-matrix, one ends up paying a double penalty: when forming the -ensemble, and during sampling.

One solution would be to formulate a sparse -ensemble, in which case alg. 3 carries over in sparse matrix algebra. Another, perhaps simpler approach is to subdivide the dataset ahead of time and use DPPs in each subset. We plan to investigate these possibilities in future work.