Determinantal point processes (DPP) form a family of distributions sampling diverse subsets of points from a given domain, where the diversity is measured by the squared volume of the parallelopiped spanned by the points in some predefined space (such as ). DPPs have found applications as a model for repulsive sampling in a variety of fields such as physics , statistics , machine learning , computational geometry , graph theory  and others. Here we mention a few such applications as examples:
Let be a tall and thin matrix, i.e. . Determinantal point process is defined as a distribution over all subsets such that
where denotes the submatrix of containing rows indexed by . DPP algorithms are typically divided into a preprocessing step, which needs to be performed once per given matrix , and a sampling step, which happens each time we wish to sample set . In this paper, we improve on the best known time complexity of both steps by demonstrating:
the first DPP algorithm s.t. preprocessing takes input sparsity time: ;
the first exact DPP algorithm s.t. the sampling step takes time, independent of .
Here, denotes the number of non-zero entries. Before this work, the best known DPP algorithms (see Section 2 and references therein) had preprocessing times , or if we allow conditioning on a fixed subset size (which can be as large as ), and sampling times at least
. Our DPP algorithm is based on the following general recipe for sampling from some “target” joint distribution:generate a larger sample of rows from an “intermediate” distribution; downsample to a smaller subset using the “target” distribution. Crucially, the intermediate sampling is not allowed to distort the target distribution:
A simplified version of this approach was recently suggested by . They sample from a different but related family of determinantal distributions called size volume sampling, which has the unique property that it can play the role of both the “intermediate” and the “target” distribution. DPPs on the other hand do not have that property and no candidate for an intermediate distribution was previously known. To that end, we develop a new family of joint distributions: regularized determinantal point processes (R-DPP). For a p.s.d. matrix and a Poisson mean parameter an samples over all sequences so that
Rescaling with Poisson probabilities is essential for the normalization constant of this distribution to have a closed form (a key part of our analysis). We obtain it via the following new determinantal formula: if is a sequence of i.i.d. samples from , then
). Our result shows that regularization can be introduced by randomizing the sequence length with Poisson distribution. In Section4 we use this formula to show a number of connections between R-DPPs and DPPs. For example, the latter can be obtained from the former when the Poisson mean parameter converges to 0 along with regularization set to :
This means that the family of R-DPPs contains DPPs in its closure and can thus be viewed as an extension. We also show that DPPs are preserved under subsampling with an R-DPP:
This suggests that R-DPP is a good candidate for an “intermediate” distribution when sampling a DPP. To make sampling from R-DPPs efficient, we further generalize them so that the marginal row probabilities can be reweighted with an arbitrary i.i.d. distribution. On a very high level, our strategy is to show that when the length of sequence is sufficiently large in expectation (but still independent of ) and the R-DPP is reweighted by ridge leverage scores, then it becomes very close to i.i.d. sampling, so we can use that as a proposal distribution for a rejection sampling scheme.
In the following section we give some background and related work on DPP algorithms, then in Section 3 we present our main algorithm (Algorithm 2) and the associated result (Theorem 1), along with an example application in low-rank matrix reconstruction. Section 4 introduces R-DPPs along with their basic properties and the remaining Sections 5 and 6 are devoted to proving Theorem 1.
2 Background and related work
Several settings have been used in the literature for studying the
complexity of DPP algorithms. Below we review those which are slightly
different than the one we presented in Section 1, but are
still relevant to our discussion (for a more in-depth survey see ).
-ensemble We defined a DPP in terms of a matrix , where each element
is described by a row vector(this was suggested by  and ). An equivalent parameterization can be defined in terms of the so-called ensemble matrix , where the th entry represents the dot product . In this case, the probability (1) of a subset can be written as , where denotes the submatrix of with both row and column entries indexed by . Naturally, one representation can be converted to the other in preprocessing, so our results are still useful in the -ensemble case when . However, since matrix is much larger than , the preprocessing costs can differ considerably.
k-DPP In some practical applications, when a subset of particular size is desired, a DPP can be restricted only to subsets such that for some , and referred to as a k-DPP . This is equivalent to sampling a set from a standard DPP, but accepting the sample only if . This distribution is also sometimes called size volume sampling (see, e.g., [15, 14]), not to be confused with size volume sampling mentioned in Section 1. The special case of will be further discussed in Section 2.2. An alternative way of controlling the subset size is by rescaling matrix with some so that the expected subset size for , i.e. matches a desired value (see Section 3 for more details). A restriction to k-DPP can lead to faster sampling algorithms (when is small), as discussed in the following sections.
) uses the singular value decomposition (SVD) of eitheror to produce an exact sample from the DPP in time . However, this runtime does not include the cost of SVD, considered as a preprocessing step, which takes and for and , respectively. Since then, a number of methods were proposed to improve on this basic approach. We survey these techniques in the following two sections, and present a runtime comparison in Tables 1 and 2 (we omit the big-O notation in the tables).
2.1 Previous approximate preprocessing techniques
Approximate preprocessing methods were studied primarily for k-DPPs (i.e. ) rather than for standard DPPs, because bounding the subset size makes volume approximations easier to control. In [14, 18] it was suggested to use volume-preserving sketching of  to reduce the dimension of matrix to , which allows -approximate sampling from a k-DPP, i.e. for all . Here, the cost of sketching is , and it is followed by computing SVD of an matrix in time . Also,  proposed to use a fast-mixing MCMC algorithm whose stationary distribution is a k-DPP, where preprocessing cost is for matrix , and for matrix (sampling time is similar). Other approximation techniques such as Nystrom  and coresets  yield approximate DPP distributions whose accuracy is incomparable due to data dependence. Table 1 compares the preprocessing costs for the data-independent -approximate k-DPP methods with that of our DPP algorithm. Note that our approach is specifically designed for sampling from a full DPP, not a k-DPP. Also, unlike these -approximate methods our algorithm samples from a proper determinantal process.
2.2 Sampling a DPP as a mixture of volume samples
The mixture decomposition shown by  implies that DPP sampling can be divided into two steps: first sample one element from the mixture, then generate a sample from that elementary DPP. Specifically, consider the eigendecompositions and
. For convenience, let us put the eigenvectors ofinto a matrix . Then can be produced by sampling a subset of eigenvector indices and then performing volume sampling w.r.t. the matrix constructed from those vectors:
Here, we used vectors from the decomposition of , but this can be easily obtained from the decomposition of during preprocessing, since . So given the eigendecomposition we can sample the set easily, and it remains to perform the volume sampling step.
In Table 2 we review the running times for different approaches of sampling (compared to an MCMC-based DPP sampler  and our algorithm). The classical approach from DPP literature [22, 5, 26] samples “bottom-up”, adding one point at a time and at each step projecting the remaining points onto the subspace orthogonal to that point (see Algorithm 1). Curiously, a diametrically opposed “top-down” approach of , which eliminates points one at a time instead of adding them, achieves the same asymptotic runtime with high probability. The volume sampling algorithm of  further improves on this downsampling strategy by introducing an intermediate i.i.d. oversampling step, which is what inspired our approach. Note that we are the first to apply the algorithms of [11, 13] to DPP sampling (they were previously known only in the context of volume sampling). Unfortunately, in all of these methods sampling time is linear111A different strategy was proposed in , which uses coreset construction to approximately sample from a DPP in time independent of , however the sampling accuracy is data-dependent, and so incomparable. in (they have to read the matrix for each sampled set) which may not be acceptable when and we need to perform the sampling repeatedly. In contrast, our algorithm, which uses the mixture decomposition only as a subroutine, samples in time independent of .
3 Main result
As discussed in Section 1, the high level strategy of our algorithm is to design an “intermediate” sampling distribution, which reduces the size of the matrix from to while preserving the “target” distribution (here, a DPP). Another key property of the intermediate distribution is that it has be close to i.i.d., so that we can implement it efficiently. Thus, our algorithm will use an i.i.d. sampling distribution defined by a vector of importance weights assigned to each row of , and use it as a proposal for rejection sampling from the intermediate distribution. Overall, the three main components of our method are:
In Section 4 we define a regularized determinantal point process (R-DPP) and use it as the intermediate distribution. For the i.i.d. sampling weights we use a particular variant of so-called ridge leverage scores . Let denote the th row of and w.l.o.g. assume that all rows are non-zero.
Other than matrix , the algorithm takes additional inputs: matrix and a sampling oracle for which approximates . The inputs and are computed in the preprocessing step, which can be easily performed in time , but in Section 6 we show how standard sketching techniques can be used to achieve input sparsity time preprocessing. In line 7 of the algorithm we invoke a different DPP sampling procedure for a matrix of reduced size. This can be for example the classical algorithm discussed in Section 2.2 with the volume sampling part implemented by Algorithm 1 and SVD performed exactly. Our main result shows that Algorithm 2 samples correctly from a determinantal point process and runs in time independent of .
The only trade-off coming from approximate preprocessing in the algorithm is that it samples from instead of for some , which in fact has a closed form (see Lemma 6), but is more expensive to compute exactly. This rescaling only affects the distribution of sample size (and not the relative probabilities of sets of equal size), making it a much stronger and more useful guarantee than typical approximate sampling where every probability in the distribution can be arbitrarily distorted up to some factor.
The proof of Theorem 1 is spread out across Sections 4, 5 and 6. In Section 4 we define regularized determinantal point processes and show how they can be used for sampling DPPs. Then, in Section 5 we show that Algorithm 2 indeed samples from . Finally, we bound the time complexity of the algorithm and of the preprocessing in Section 6.
Application: row-based low-rank matrix reconstruction
One application of DPPs aims to find a small subset of rows of matrix such that the subspace spanned by those rows captures the full matrix nearly as well as the best rank- approximation in terms of the Frobenius norm . Let be the projection matrix onto the span of vectors . In  it was shown that if then for any ,
and that for any the bound is tight up to lower order terms.222Other methods are known for this and related tasks which achieve near-optimal bounds (e.g., see [15, 6, 7]). In particular, for the ratio to become , we need rows sampled from a DPP. Even though it is most natural to use fixed-size DPPs in this context, the sample size for a standard DPP is sufficiently concentrated around its mean to offer near-optimal guarantees for this task. In fact, as shown by  (see the tail bound of Theorem 3.5), for , if then w.p. we have for some absolute constant . The expected sample size can be derived as:
Since is monotonic w.r.t. the eigenvalues we can use a simple binary search to find a rescaling for which . Thus, if , then for , where , we have
We can obtain the same guarantee if we replace the eigenvalues of in (4) with those of matrix satisfying condition (2), so by Theorem 1 the total cost of obtaining such a sample would be333If we forgo exact DPP sampling, then projection-cost preserving sketches  may offer further speed-ups. . This raises the following natural question: can our techniques be extended to sampling from fixed-size DPPs, so that the optimal sample size can be achieved exactly and in time ? We leave this as a new direction for future work.
4 Regularized determinantal point processes (R-DPP)
We propose a new family of determinantal sampling distributions which will be used in Sections 5 and 6 to prove Theorem 1. The crucial property of this family is that the determinantal sampling probabilities can be regularized by adding an arbitrary fixed positive semi-definite (p.s.d.) matrix inside of the determinant, while maintaining many of the natural properties of a DPP, such as a simple normalization constant. This is achieved by controlling the size of the sample with a Poisson random variable. In the proofs that follow we will use the shorthand .
Given matrix , distribution , p.s.d. matrix and , we define as a distribution over all index sequences , s.t.
Whenever is uniform, we will write . Of course, we need to establish that this is in fact a valid distribution, i.e. that it sums to one. We achieve this by showing a new variant of the classical Cauchy-Binet formula (the classical formula is stated in (6) below).
Given and p.s.d. , if , then
Previously shown identities for fixed size do not generalize naturally to the regularized setting unless randomness in is introduced. We start the proof by applying the Cauchy-Binet formula (6) to the term under the expectation. Let be any decomposition of s.t. where . To apply the Cauchy-Binet formula, we sum over all -element subsets of the union of rows of matrices and :
Applying the law of total expectation w.r.t. the Poisson variable , we obtain
where follows from the exchangeability of sequence (so that the value of the expectation is the same for any subset ), in we expand the expectation and note that only unique sequences will have a non-zero determinant (hence the switch to subsets and the term ), in we absorb the factors and into the determinant, and finally is the classical Cauchy-Binet. It was crucial that we were able to absorb the subset size into the Poisson series, which allowed the formula to collapse to a single determinant. This completes the proof because .
In what sense is R-DPP a natural extension of a determinantal point process? Naively, we might say that setting the matrix
to an all-zeros matrix would recover the classical distribution, however this is not the case because whenonly samples of size or larger will have non-zero probability. Instead, we can demonstrate a connection to both DPP and Volume Sampling distributions in a different way: we show that they can be obtained as the limiting distributions of R-DPP when the regularization and sample size parameter converge to zero (in two separate ways), which is remarkable as the two distributions in most cases produce vastly different samples.
For , ignoring the ordering in the sequences sampled with R-DPP, we have
Theorem 4 suggests that R-DPPs are likely to be of independent interest as an extension of DPPs. However, it does not say how to use them algorithmically. To that end, we make a second observation, which essentially states that DPPs are preserved under subsampling with R-DPPs.
For any , and distribution over s.t. , let denote matrix with th row rescaled by for every . It follows that for any ,
Using the law of total probability we compute the probability of sampling setof size :
where the cancellation in follows from Sylvester’s Theorem, and in we use the exchangeability of sequence to observe that for any subset of size the value of the proceeding sum is the same (the factor counts the number of such subsets and counts the number of sequences of length that correspond to set ).
5 Correctness of the main algorithm
Given and a non-zero p.s.d. matrix , Algorithm 2 returns , with where and .
The general idea of the proof is to show that the main repeat loop is implementing an R-DPP, so that we can invoke Theorem 5. Central to this fact is the choice of numerical factor appearing in the denominator of Bernoulli sampling probability in line 5, which we denote here as . This factor has to depend on because otherwise the determinantal term will always dominate it for large enough . This means that needs to be carefully chosen so that:
the acceptance probability of line 5 is always bounded by 1,
we have control over how the presence of changes the distribution of ,
and is not too large so that we may have a good chance of accepting the sample.
For any , , and non-negative integers s.t. we have
Let be the matrix where for each the th row is rescaled by , with (same as in line 5 of the algorithm), and let
. We use arithmetic-geometric mean inequality for the eigenvalues of matrixand the fact that , obtaining:
where follows from Lemma 7 invoked with and . Having established the validity of the rejection sampling in Algorithm 2, we now compute the distribution of sample at the point of exiting the repeat loop. Denoting as the desired Poisson mean parameter and as the normalization for the sampling probabilities in line 4,