 # Exact sampling of determinantal point processes with sublinear time preprocessing

We study the complexity of sampling from a distribution over all index subsets of the set {1,...,n} with the probability of a subset S proportional to the determinant of the submatrix L_S of some n× n p.s.d. matrix L, where L_S corresponds to the entries of L indexed by S. Known as a determinantal point process, this distribution is used in machine learning to induce diversity in subset selection. In practice, we often wish to sample multiple subsets S with small expected size k = E[|S|] ≪ n from a very large matrix L, so it is important to minimize the preprocessing cost of the procedure (performed once) as well as the sampling cost (performed repeatedly). For this purpose, we propose a new algorithm which, given access to L, samples exactly from a determinantal point process while satisfying the following two properties: (1) its preprocessing cost is n ·poly(k), i.e., sublinear in the size of L, and (2) its sampling cost is poly(k), i.e., independent of the size of L. Prior to our results, state-of-the-art exact samplers required O(n^3) preprocessing time and sampling time linear in n or dependent on the spectral properties of L. We also give a reduction which allows using our algorithm for exact sampling from cardinality constrained determinantal point processes with n·poly(k) time preprocessing.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Given a positive semi-definite (psd) matrix , a determinantal point process (also known as an -ensemble) is a distribution over all index subsets such that

 Pr(S)≜det(LS)det(I+L)\raisebox2.15pt,

where denotes the submatrix of with rows and columns indexed by . Determinantal point processes naturally appear across many scientific domains [Mac75, BLMV17, Gue83], while also being used as a tool in machine learning and recommender systems [KT12]

for inducing diversity in subset selection and as a variance reduction approach. In this context, we often wish to efficiently produce many DPP samples of small expected size

111 To avoid complicating the exposition with edge cases, we assume . Note that this can be always satisfied without distorting the distribution by rescaling by a constant, and is without loss of generality, as our analysis can be trivially extended to the case with some additional notation. given a large matrix . Sometimes the distribution is restricted to subsets of fixed size , denoted -. [HKP06] gave an algorithm for drawing samples from distributed exactly, later adapted to - by [KT11], which can be implemented to run in polynomial time. In many applications, however, sampling is still a computational bottleneck because the algorithm requires performing the eigendecomposition of matrix at the cost of . In addition to that initial cost, producing many independent samples at high frequency poses a challenge because the cost of each sample is at least linear in . Many alternative algorithms have been considered for both DPPs and

-DPPs to reduce the computational cost of preprocessing and/or sampling, including many approximate and heuristic approaches. Contrary to approximate solutions, we present an algorithm which samples

exactly from a DPP or a -DPP with the initial preprocessing cost sublinear in the size of and the sampling cost independent of the size of .

###### Theorem 1

For a psd matrix , let be i.i.d. random sets from or from any -. Then, there is an algorithm which, given access to , returns

1. [label=)]

2. the first subset, , in:  time,

3. each subsequent in:       time.

We refer to this algorithm as the very fast and exact DPP sampler, or DPP-VFX. Table 1 compares DPP-VFX with other DPP and -DPP sampling algorithms. In this comparison, we focus on the methods that provide strong accuracy guarantees. As seen from the table, our algorithm is the first exact sampler to achieve sublinear overall runtime. Only the approximate MCMC sampler of [AGR16] matches our complexity (and only for a -DPP), but for this method every next sample is equally expensive, making it less practical when repeated sampling is needed. In fact, to our knowledge, no other exact or approximate method (with rigorous approximation guarantees) achieves sampling time of this paper.

Our method is based on a technique developed recently by [DWH18, DWH19] and later extended by [Der19]. In this approach, we carefully downsample the index set to a sample that is small but still sufficiently larger than the expected target size , and then run a DPP on . As the downsampling distribution we use a regularized determinantal point process (R-DPP), proposed by [Der19], which (informally) samples with probability , where is a rescaled version of . Overall, the approach is described in the diagram below, where ,

 {1,...,n}  σ∼R-DPP−−−−−→  (σ1,...,σt)  ˜S∼DPP−−−−→  S={σi:i∈˜S}.

The DPP algorithm proposed by [Der19] follows the same diagram, however it requires that the size of the intermediate sample be . This means that their method provides improvement over [HKP06] only when can be decomposed as for some matrix , with . However, in practice, matrix is often only approximately low-rank

, i.e., it exhibits some form of eigenvalue decay but it does not have a low-rank factorization. In this case, the results of

[Der19] are vacuous both in terms of the preprocessing cost and the sampling cost, in that obtaining every sample would take . We propose a different R-DPP implementation (see DPP-VFX as Algorithm 1) where the expected size of is . To make the algorithm efficient, we use new connections between determinantal point processes, ridge leverage scores, and Nyström approximation.

###### Definition 1

Given a psd matrix , its th -ridge leverage score (RLS) is the th diagonal entry of . The -effective dimension is the sum of the leverage scores, .

An important connection between RLSs and DPPs is that when the marginal probability of index being sampled into is equal to the th -ridge leverage score of , and the expected size of is equal to the -effective dimension:

Intuitively, if the marginal probability of is high, then this index should likely make it into the intermediate sample . This suggests that i.i.d. sampling of the indices proportionally to 1-ridge leverage scores, i.e. , should serve as a reasonable and cheap heuristic for constructing . In fact, we can show that this distribution can be easily corrected by rejection sampling to become the R-DPP that we need. Computing ridge leverage scores exactly costs , so instead we compute them approximately by first constructing a Nyström approximation of .

###### Definition 2

Let be a psd matrix and a subset of its row/column indices with size . Then we define the Nyström approximation of based on as the matrix .

Here, denotes an matrix consisting of (entire) rows of indexed by and denotes the Moore-Penrose pseudoinverse. Since we use rejection sampling to achieve the right intermediate distribution, the correctness of our algorithm does not depend on which Nyström approximation is chosen. However, the subset greatly influences the computational cost of the sampling through the rank of and the probability of rejecting a sample. Since , operations such as multiplication and inversion involving the Nyström approximation will scale with , and therefore a small subset increases efficiency. However, if is too different from , the probability of rejecting the sample will be very high and the algorithm inefficient. In this case, a slightly larger subset could improve accuracy and acceptance rate without increasing too much the cost of handling . Therefore, subset has to be selected so that it is both small and accurately represents the matrix . Here, we once again rely on ridge leverage score sampling which has been effectively used for obtaining good Nyström approximations in a number of prior works such as [AM15, CLV17, RCCR18].

While our main algorithm can sample only from the random-size DPP, and not from the fixed-size -DPP, we present a rigorous reduction argument which lets us use our DPP algorithm to sample exactly from a -DPP (for any ) with a small computational overhead.

#### Related work

Prior to our work, fast exact sampling from generic DPPs has been considered out of reach. The first procedure to sample general DPPs was given by [HKP06] and even most recent exact refinements [LGD18, Der19, Pou19], when the DPP is represented in the form of an -ensemble, require preprocessing that amounts to an expensive matrix diagonalization at the cost , which is shown as the first-sample complexity column in Table 1.

Nonetheless, there are well-known samplers for very specific DPPs that are both fast and exact, for instance for sampling uniform spanning trees [Ald90, Bro89, PW98], which leaves the possibility of a more generic fast sampler open. Since the sampling from DPPs has several practical large scale machine learning applications [KT12], there are now a number of methods known to be able to sample from a DPP approximately, outlined in the following paragraphs.

As DPPs can be specified by kernels (-kernels or -kernels), a natural approximation strategy is to resort to low-rank approximations [KT11, GKT12, AKFT13, LJS16a]. For example, [AKFT13] provides approximate guarantee for the probability of any subset being sampled as a function of eigengaps of the -kernel. Next, [LJS16a] construct coresets approximating a given -DPP and then use them for sampling. In their Section 4.1, [LJS16a] show in which cases we can hope for a good approximation. These guarantees become tight if these approximations (Nyström subspace, coresets) are aligned with data. In our work, we aim for an adaptive approach that is able to provide a good approximation for any DPP.

The second class of approaches are based on Markov chain Monte-Carlo

[MU49] techniques [Kan13, RK15, AGR16, LJS16b, GBV17]. There are known polynomial bounds on the mixing rates [DS91] of MCMC chains with arbitrary DPPs as their limiting measure. In particular, [AGR16] showed them for cardinality-constrained DPPs and [LJS16b] for the general case. The two chains have mixing times which are, respectively, linear and quadratic in (see Table 1). Unfortunately, for any subsequent sample we need to wait until the chain mixes again.

Neither the known low-rank approximations or the known MCMC methods are able to provide samples that are exacty distributed (also called perfect sampling) according to a DPP. This is not surprising as having scalable and exact sampling is very challenging in general. For example, methods based on rejection sampling are always exact, but they typically do not scale with the dimension and are adversely affected by the spikes in the distribution [EVCM16], resulting in high rejection rate and inefficiency. Surprisingly, our method is based on both low-rank approximation (a source of inaccuracy) and rejection sampling (a common source of inefficiency). In the following section, we show how to obtain a perfect DPP sampler from a Nyström approximation of the -kernel. Then, to guarantee efficiency, in Section 3 we bound the number of rejections, which is possible thanks to the use of intermediate downsampling.

## 2 Exact sampling using any Nyström approximation

#### Notation

We use to denote the set . For a matrix and index sets , , we use to denote the submatrix of consisting of the intersection of rows indexed by with columns indexed by . If , we use a shorthand and if , we may write . Finally, we also allow to be multisets (or sequences), in which case each entry is rescaled by the multiplicities of its row and column indices as follows: the entry at the intersection of the th row with the th column becomes , where is the multiplicity of in and is the multiplicity of in . Such a rescaling has the property that if then .

As discussed in the introduction, our method relies on an intermediate downsampling distribution to reduce the size of the problem. The exactness of our sampler relies on the careful choice of that intermediate distribution. To that end, we use regularized determinantal processes, introduced by [Der19]. In the below definition, we adapt them to the kernel setting.

###### Definition 3

Given an psd matrix , distribution and , let denote an matrix such that for all . We define as a distribution over events , where

 Pr(A)=Eσ[1[σ∈A]det(I+˜Lσ)]det(I+L)\raisebox2.15pt,forσ=(σ1,…,σt)i.i.d.∼p,t∼Poisson(r).

Since the term has the same form as the normalization constant of , an easy calculation shows that the R-DPP can be used as an intermediate distribution in our algorithm without introducing any distortion in the sampling.

###### Proposition 1 (Der19, Theorem 8)

For any , , and defined as in Definition 3,

 ifσ∼R-DPPrp(L)andS∼DPP(˜Lσ)then{σi:i∈S}∼DPP(L).

To sample from the R-DPP, DPP-VFX uses rejection sampling, where the proposal distribution is sampling i.i.d. proportionally to the approximate 1-ridge leverage scores (see Definition 1 and the following discussion), computed using any Nyström approximation of matrix . Apart from , the algorithm also requires an additional parameter , which controls the size of the intermediate sample. Because of rejection sampling and Proposition 1, the correctness of the algorithm does not depend on the choice of and , as demonstrated in the following result. The key part of the proof involves showing that the acceptance probability in Line 4 is bounded by 1. Here, we obtain a considerably tighter bound than the one achieved by [Der19], which allows us to use a much smaller intermediate sample (see Section 3) while maintaning the efficiency of rejection sampling.

###### Theorem 2

Given a psd matrix , any one of its Nyström approximations and any positive , DPP-VFX returns .

Proof  We start by showing that the Bernoulli probability in Line 4 is bounded by 1. Note that this is important not only to sample correctly, but also when we later establish the efficiency of the algorithm. If we showed a weaker upper bound, say , we could always divide the expression by and retain the correctness, however it would also be times less likely that .

Since is a Nyström approximation for some , it can be written as

 ˆL=LI,CL+CLC,I=BB⊤C,IL+CBC,IB⊤=BPB⊤,

for any such that , where is a projection (so that ). Let , where the th row of is the rescaled th row of , i.e. . Then, we have

 det(I+˜Lσ)det(I+ˆL) =det(I+˜B⊤σ,I˜Bσ,I)det(I+PB⊤BP)=det((I+˜B⊤σ,I˜Bσ,I)(I+PB⊤BP)−1) ≤exp(tr((˜B⊤σ,I˜Bσ,I−PB⊤BP)(I+PB⊤BP)−1)) =exp(t∑i=1sqlσi[B(I+PB⊤BP)−1B⊤]σiσi)⋅e−z = ets/qe−z,

where the last equality follows because

 B(I+PB⊤BP)−1B⊤ =B(I−PB⊤(I+BPB⊤)−1BP)B⊤ =L−ˆL(I+ˆL)−1ˆL=L−ˆL+ˆL(I+ˆL)−1.

Thus, we showed that the expression in Line 4 is valid. Let

denote the random variable distributed as

is after exiting the repeat loop. It follows that

 Pr(˜σ∈A) ∝Eσ[1[σ∈A]ezdet(I+˜Lσ)ets/qdet(I+ˆL)]∝∞∑t=0(qes/q)teqes/qt!⋅e−ts/qEσ[1[σ∈A]det(I+˜Lσ)∣t] ∝Et′[Eσ[1[σ∈A]det(I+˜Lσ)∣t=t′]]for t′∼Poisson(q),

which shows that for . The claim follows from Proposition 1.
Even though the choice of affects the overall execution of the algorithm, it does not affect the output so we can reuse the same to produce multiple independent samples (proof in Appendix A).

###### Lemma 1

Let be a random set variable with any distribution. Suppose that and are returned by two executions of DPP-VFX, both using inputs constructed from the same and . Then and are (unconditionally) independent.

## 3 Conditions for fast sampling

The complexity cost of DPP-VFX can be roughly summarized as follows: we pay a large one-time cost to precompute and all its associated quantities, and then we pay a smaller cost in the rejection sampling scheme which must be multiplied by the number of times we repeat the loop until acceptance. We first show that if the sum of the approximate RLS (i.e., , denoted by ) is sufficiently close to , then we will exit the loop with high probability. We then analyze how accurate the precomputing step needs to be to satisfy this condition.

Bounding the number of rejections.  The following result presents the two conditions needed for achieving efficient rejection sampling in DPP-VFX. First, the Nyström approximation needs to be accurate enough, and second, the intermediate sample size (controlled by parameter ) needs to be . This is a significant improvement over the guarantee of [Der19] where the intermediate sample size is , which is only meaningful for low-rank kernels. The main novelty in this proof comes from showing the following lower bound on the ratio of the determinants of and , when is a Nyström approximation: , where . Remarkably, this bound exploits the fact that any Nyström approximation of can be written as , where is a projection matrix. Note that while our result holds in the worst-case, in practice the conditions on and on can be considerably relaxed.

###### Theorem 3

If the Nyström approximation and the intermediate sample size parameter satisfy

 tr(L(I+L)−1L−ˆL(I+ˆL)−1ˆL)=s−k ≤ 1andq ≥ max{s2,s},

then . Therefore, with probability Algorithm 1 exits the rejection sampling loop after at most iterations and, after precomputing all of the inputs, the time complexity of the rejection sampling loop is .

Proof  Le be distributed as in Line 3. The probability of exiting the repeat loop at each iteration is

 P≜Eσ[ezdet(I+˜Lσ)ets/qdet(I+ˆL)] =∞∑t=0(qes/q)teqes/qt!⋅ez−ts/qdet(I+ˆL)Eσ[det(I+˜Lσ)∣t]

where the last equality follows because the infinite series computes the normalization constant of given in Definition 3. If then and the inequality for implies that . On the other hand, if , then . We proceed to lower bound the determinantal ratio. Here, let and , where is a projection matrix. Then,

 det(I+L)det(I+ˆL) =det(I+B⊤B)det(I+PB⊤BP)=det(I−(B⊤B−PB⊤BP)(I+B⊤B)−1)−1 ≥exp(tr((B⊤B−PB⊤BP)(I+B⊤B)−1)) =exp(tr(B(I+B⊤B)−1B⊤)−tr(BP(I+B⊤B)−1PB⊤)) =exp(k−tr(BP(I+B⊤B)−1PB⊤)) =exp(k−tr(BP(P+PB⊤BP)−1PB⊤)),

where by definition and we pushed inside the inverse. Moreover, if we denote we have and

 z

where again we push inside the inverse. Putting all together,

We can now lower bound the acceptance probability: . Here, crucially, the two terms compensated for one another by eliminating  from the bound, which would otherwise be difficult to control. Thus, with probability the main loop will be repeated times. Since the number of samples drawn from in the

th iteration of the loop is a Poisson distributed random variable, a standard Poisson tail bound implies that with probability

all iterations will satisfy . Since the dominant cost is computing the determinant of the matrix in time , and the final step, , requires its eigendecomposition (with the same time complexity), the result follows.
Bounding the precompute cost.  All that is left now is to control the cost of the precomputation phase. We will separate the analysis into two steps: how much it costs to choose to satisfy the condition , and how much it costs to compute everything else given (see Appendix A).

###### Lemma 2

Let be constructed by sampling columns proportionally to their RLS. Then, with probability , satisfies Theorem 3, i.e., .

There exist many algorithms to sample columns proportionally to their RLS. For example, we can take the BLESS algorithm from [RCCR18] with the following guarantee.

###### Proposition 2 (Rccr18, Theorem 1)

There exists an algorithm that with probability samples  columns proportionally to their RLS in time.

We can now compute the remaining preprocessing costs, given a Nyström approximation .

###### Lemma 3

Given with rank , we can compute , , , and in time.

We are finally ready to combine these results to fully characterize the computational cost.

###### Theorem 1 (restated for DPPs only)

For a psd matrix , let be i.i.d. random sets from . Denote with a Nyström approximation of obtained by sampling of its columns proportionally to their RLS. Then, with probability , DPP-VFX returns

1. [label=)]

2. subset in:  time,

3. then, in:     time.

Discussion.  Due to the nature of rejection sampling, as long as we exit the loop, i.e., we accept the sample, the output of DPP-VFX is guaranteed to follow the DPP distribution for any value of . In Theorem 1 we set to satisfy Theorem 3 and guarantee a constant acceptance probability in the rejection sampling loop, but this might not be necessary or even desirable in practice. Experimentally, much smaller values of , starting from seem to be sufficient to accept the sample, while at the same time a smaller greatly reduces the preprocessing costs. In general, we recommend to separate DPP-VFX

in three phases. First, compute an accurate estimate of the RLS using off-the-shelf algorithms in

time. Then, sample a small number of columns to construct an explorative , and try to run DPP-VFX If the rejection sampling loop does not terminate sufficiently fast, then we can reuse the RLS estimates to compute a more accurate for a larger . Using a simple doubling schedule for , this procedure will quickly reach a regime where DPP-VFX is guaranteed to accept w.h.p., maintaining its asymptotic complexity, while at the same time resulting in faster sampling in practice.

## 4 Reduction from DPPs to k-DPPs

We next show that with a simple extra rejection sampling step we can efficiently transform any exact DPP sampler into an exact -DPP sampler.

A common heuristic to sample from a -DPP is to first sample from a DPP, and then reject the sample if the size of is not exactly . The success probability of this procedure can be improved by appropriately rescaling by a constant factor ,

 sampleSα∼DPP(αL),accept if |Sα|=k.

Note that rescaling the DPP by a constant only changes the expected size of the set , and not its distribution. Therefore, if we accept only sets with size , we will be sampling exactly from our -DPP. Moreover, if is close to , the success probability will improve. With a slight abuse of notation, in the context of -DPPs we will indicate with the desired size of , i.e., the final output size at acceptance, and with the expected size of the scaled DPP.

While this rejection sampling heuristic is widespread in the literature, until now there has been no proof that the process can be provably made efficient. We solve this open question with two new results. First we show that for an appropriate rescaling we only reject roughly times. Then, we show how we can find such an with a time preprocessing step.

###### Theorem 4

There exists constant such that for any rank psd matrix and , there is with the following property: if we sample , then .

The proof (see Appendix B) relies on a known Chernoff bound for the sample size of a DPP. When applied naïvely, the inequality does not offer a lower bound on the probability of any single sample size, however we can show that the probability mass is concentrated on sizes. This leads to a lower bound on the sample size with the largest probability, i.e., the mode of the distribution. Then, it remains to observe that for any we can always find for which is the mode of . We conclude that given , the rejection sampling scheme described above transforms any time DPP sampler into a time -DPP sampler. It remains to find efficiently, which once again relies on using a Nyström approximation of .

###### Lemma 4

If , then there is an algorithm that finds in time.

Figure 1: First-sample cost for DPP-VFX against other DPP samplers ( is data size). Figure 2: Resampling cost for DPP-VFX compared to the exact sampler of [HKP06].

## 5 Experiments

In this section, we experimentally evaluate the performance of DPP-VFX compared to exact sampling [HKP06] and MCMC-based approaches [AGR16]. In particular, since Section 2 proves that DPP-VFX samples exactly from the DPP, we are interested in evaluating computational performance. This will be characterized by showing how DPP-VFX and baselines scale with the size of the matrix when taking a first sample, and how DPP-VFX achieves constant time when resampling.

To construct we use random subsets of the infinite MNIST digits dataset [LCB07], where varies up to and . We use an RBF kernel with to construct . All algorithms are implemented in python. For exact and MCMC sampling we used the DPPy library, [GBV18], while for DPP-VFX we reimplemented BLESS [RCCR18], and used DPPy to perform exact sampling on the intermediate subset. All experiments are carried out on a 24-core CPU and fully take advantage of potential parallelization. For the Nyström approximation we set . While this is much lower than the value suggested by the theory, as we will see it is already accurate enough to result in drastic runtime improvements over exact and MCMC. For each algorithm we control222For simplicity we do not perform the full -DPP rejection step, but only adjust the expected size of the set. the size of the output set by rescaling the input matrix by a constant, following the strategy of Section 4. In Figure 1

we report our results, means and 95% confidence interval over 10 runs, for subsets of MNIST that go from

to , i.e., the whole original MNIST dataset.

Exact sampling is clearly cubic in , and we cannot push our sampling beyond . For MCMC, we enforce mixing by runnning the chain for steps, the minimum recommended by [AGR16]. However, for the MCMC runtime is seconds and cannot be included in the plot, while DPP-VFX completes in seconds, an order of magnitude faster. Moreover, DPP-VFX rarely rejects more than 10 times, and the mode of the rejections up to is , that is we mostly accept at the first iteration. Figure 2 reports the cost of the second sample, i.e., of resampling. For exact sampling, this means that an eigendecomposition of is already available, but as the plot shows the resampling process still scales with . On the other hand, DPP-VFX’s complexity (after preprocessing) scales only with and remains constant regardless of .

Finally, we scaled DPP-VFX to points, a regime where neither exact nor MCMC approaches are feasible. We report runtime and average rejections, with mean and 95% confidence interval over 5 runs. DPP-VFX draws its first sample in seconds, with only rejections.

### Acknowledgements

MD thanks the NSF for funding via the NSF TRIPODS program.

## References

• [AGR16] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte carlo markov chain algorithms for sampling strongly rayleigh distributions and determinantal point processes. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 103–115, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
• [AKFT13] Raja Hafiz Affandi, Alex Kulesza, Emily Fox, and Ben Taskar. Nystrom approximation for large-scale determinantal processes. In Carlos M. Carvalho and Pradeep Ravikumar, editors,

Proceedings of the Sixteenth International Conference on Artificial Intelligence and Statistics

, volume 31 of Proceedings of Machine Learning Research, pages 85–98, Scottsdale, Arizona, USA, 29 Apr–01 May 2013. PMLR.
• [Ald90] David J Aldous. The random walk construction of uniform spanning trees and uniform labelled trees. SIAM Journal on Discrete Mathematics, 3(4):450–465, 1990.
• [AM15] Ahmed El Alaoui and Michael W. Mahoney.

Fast randomized kernel ridge regression with statistical guarantees.

In Proceedings of the 28th International Conference on Neural Information Processing Systems, pages 775–783, Montreal, Canada, December 2015.
• [BLMV17] Rémi Bardenet, Frédéric Lavancier, Xavier Mary, and Aurélien Vasseur. On a few statistical applications of determinantal point processes. ESAIM: Procs, 60:180–202, 2017.
• [Bra14] Petter Branden. Unimodality, log-concavity, real-rootedness and beyond. Handbook of Enumerative Combinatorics, 10 2014.
• [Bro89] A Broder. Generating random spanning trees. In Foundations of Computer Science, 1989., 30th Annual Symposium on, pages 442–447. IEEE, 1989.
• [CCL19] Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and Lorenzo Rosasco. Gaussian process optimization with adaptive sketching: Scalable and no regret. In Conference on Learning Theory, 2019.
• [CLV17] Daniele Calandriello, Alessandro Lazaric, and Michal Valko. Distributed adaptive sampling for kernel matrix approximation. In AISTATS, 2017.
• [D64] John N Darroch et al. On the distribution of the number of successes in independent trials. The Annals of Mathematical Statistics, 35(3):1317–1321, 1964.
• [Der19] Michał Dereziński. Fast determinantal point processes via distortion-free intermediate sampling. In Proceedings of the 32nd Conference on Learning Theory, 2019.
• [DS91] Persi Diaconis and Daniel Stroock. Geometric Bounds for Eigenvalues of Markov Chains. The Annals of Applied Probability, 1991.
• [DWH18] Michał Dereziński, Manfred K. Warmuth, and Daniel Hsu.

Leveraged volume sampling for linear regression.

In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2510–2519. Curran Associates, Inc., 2018.
• [DWH19] Michał Dereziński, Manfred K. Warmuth, and Daniel Hsu. Correcting the bias in least squares regression with volume-rescaled sampling. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, 2019.
• [EVCM16] Akram Erraqabi, Michal Valko, Alexandra Carpentier, and Odalric-Ambrym Maillard. Pliable rejection sampling. In International Conference on Machine Learning, 2016.
• [GBV17] Guillaume Gautier, Rémi Bardenet, and Michal Valko. Zonotope hit-and-run for efficient sampling from projection DPPs. In International Conference on Machine Learning, 2017.
• [GBV18] Guillaume Gautier, Rémi Bardenet, and Michal Valko. DPPy: Sampling determinantal point processes with Python. https://arxiv.org/abs/1809.07258, 2018.
• [GKT12] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Discovering diverse and salient threads in document collections. In

Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning

, EMNLP-CoNLL ’12, pages 710–720, Stroudsburg, PA, USA, 2012. Association for Computational Linguistics.
• [Gue83] A Guenoche. Random spanning tree. Journal of Algorithms, 4(3):214 – 220, 1983.
• [H56] Wassily Hoeffding et al. On the distribution of the number of successes in independent trials. The Annals of Mathematical Statistics, 27(3):713–721, 1956.
• [HKP06] J Ben Hough, Manjunath Krishnapur, Yuval Peres, Bálint Virág, et al. Determinantal processes and independence. Probability surveys, 3:206–229, 2006.
• [Kan13] Byungkon Kang. Fast determinantal point process sampling with application to clustering. In Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pages 2319–2327, USA, 2013.
• [KT11] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-Size Determinantal Point Processes. In Proceedings of the 28th International Conference on Machine Learning, pages 1193–1200, Bellevue, WA, USA, June 2011.
• [KT12] Alex Kulesza and Ben Taskar. Determinantal Point Processes for Machine Learning. Now Publishers Inc., Hanover, MA, USA, 2012.
• [LCB07] Gaëlle Loosli, Stéphane Canu, and Léon Bottou.

Training invariant support vector machines using selective sampling.

In Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston, editors, Large Scale Kernel Machines, pages 301–320. MIT Press, Cambridge, MA., 2007.
• [LGD18] Claire Launay, Bruno Galerne, and Agnès Desolneux. Exact Sampling of Determinantal Point Processes without Eigendecomposition. arXiv e-prints, page arXiv:1802.08429, Feb 2018.
• [LJS16a] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Efficient sampling for k-determinantal point processes. In Arthur Gretton and Christian C. Robert, editors, Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, volume 51 of Proceedings of Machine Learning Research, pages 1328–1337, Cadiz, Spain, 09–11 May 2016. PMLR.
• [LJS16b] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast mixing markov chains for strongly rayleigh measures, dpps, and constrained sampling. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, pages 4195–4203, USA, 2016. Curran Associates Inc.
• [Mac75] Odile Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7(1):83–122, 1975.
• [MU49] Nicholas Metropolis and S. Ulam. The Monte Carlo method. Journal of the American Statistical Association, 44(247):335–341, 1949.
• [Pou19] Jack Poulson. High-performance sampling of generic determinantal point processes. ArXive:1905.00165v1, 2019.
• [PP14] Robin Pemantle and Yuval Peres. Concentration of lipschitz functionals of determinantal and other strong rayleigh measures. Combinatorics, Probability and Computing, 23(1):140–160, 2014.
• [PW98] J G Propp and D B Wilson. How to get a perfectly random sample from a generic Markov chain and generate a random spanning tree of a directed graph. Journal of Algorithms, 27(2):170–217, 1998.
• [RCCR18] Alessandro Rudi, Daniele Calandriello, Luigi Carratino, and Lorenzo Rosasco. On fast leverage score sampling and optimal learning. In Advances in Neural Information Processing Systems 31, pages 5672–5682. 2018.
• [RK15] P Rebeschini and A Karbasi. Fast mixing for discrete point processes. In Conference on Learning Theory, pages 1480–1500, 2015.

## Appendix A Omitted proofs for the main algorithm

In this section we present the proofs omitted from Sections 2 and 3, which regarded the correctness and efficiency of DPP-VFX. We start by showing that multiple samples drawn using the same Nyström approximation are independent.

###### Lemma 5 (restated Lemma 1)

Let be a random set variable with any distribution. Suppose that and are returned by two executions of DPP-VFX, both using inputs constructed from the same and . Then and are (unconditionally) independent.

Proof  Let and be two subsets of representing elementary events for and , respectively. Theorem 2 implies that

 Pr(S1=A∣C=B)=det(LA)det(I+L)=Pr(S1=A).

Now, for any representing elementary events for and we have that

 Pr(S1=A1∧S2=A2) =∑B∈[n]Pr(S1=A1∧S2=A2∣C=B)Pr(C=B) =∑B∈[n]Pr(S1=A1∣C=B)Pr(S2=A2∣C=B)Pr(C=B) =Pr(S1=A1)Pr(S2=A2)∑B∈[n]Pr(C=B).

Since , we get that and are independent.
We now bound the precompute cost, starting with the construction of the Nyström approximation .

###### Lemma 6 (restated Lemma 2)

Let be constructed by sampling columns proportionally to their RLS. Then, with probability , satisfies Theorem 3, i.e., .

Proof  Let and (where is a projection matrix). Using algebraic manipulation, we can write

 s=tr(L−ˆL+ˆL(I+ˆL)−1)=tr(B(PB⊤BP+I)−1B).

The matrix in the above expression been recently analyzed by [CCL19] in the context of RLS sampling who gave the following result that we use in the proof.

###### Proposition 3 (Ccl+19, Lemma 6)

Let the projection matrix be constructed by sampling columns proportionally to their RLS. Then,

 (1−ε)(B⊤B+I)⪯PB⊤BP+I⪯(1+ε)(B⊤B+I).

It is easy to see that applying the construction from Proposition 3 leads to the following bound on ,

 s=tr(B(PB⊤BP+I)−1B)≤11−εtr(B(B⊤B+I)−1B)=11−εk=k+ε1−εk.

Tuning we obtain and reordering gives us the desired accuracy result.
Finally, we show how to compute the remaining quantities needed for DPP-VFX (Algorithm 1).

###### Lemma 7 (restated Lemma 3)

Given and an arbitrary Nyström approximation of rank , computing , , , and requires time.

Proof  Given the Nyström set , let us define the matrix such that . We also introduce to act as a counterpart to . Denote with the -th indicator vector. Then, exploiting the fact that for any matrix, we can compute as

 li =[L−ˆL+¯¯¯¯¯B¯¯¯¯¯B⊤(I+¯¯¯¯¯B¯¯¯¯¯B⊤)−1]ii=[L−ˆL]ii+∥(I+ˆLm)−1/2¯¯¯¯¯B⊤ei∥22.

Computationally, this means that we first need to compute , which takes time to compute , and time for the matrix multiplication. Then, is the norm of the -th row of which can be computed in time. Similarly, requires time. To compute we simply sum , while to compute we first compute the eigenvalues of , , in time, and then compute . We can also recycle the eigenvalues to precompute .

## Appendix B Omitted proofs for the reduction to k-DPPs

In this section we present the proofs omitted from Section 4. Recall that our approach is based on the following rejection sampling strategy:

 sampleSα∼DPP(αL),accept if |Sα|=k.

First, we show the existence of the factor for which the rejection sampling is efficient.

###### Theorem 5 (restated Theorem 4)

There exists constant such that for any rank PSD matrix and , there is with the following property: if we sample