Fast determinantal point processes via distortion-free intermediate sampling

11/08/2018 ∙ by Michal Derezinski, et al. ∙ berkeley college 0

Given a fixed n× d matrix X, where n≫ d, we study the complexity of sampling from a distribution over all subsets of rows where the probability of a subset is proportional to the squared volume of the parallelopiped spanned by the rows (a.k.a. a determinantal point process). In this task, it is important to minimize the preprocessing cost of the procedure (performed once) as well as the sampling cost (performed repeatedly). To that end, we propose a new determinantal point process algorithm which has the following two properties, both of which are novel: (1) a preprocessing step which runs in time O(number-of-non-zeros(X)· n)+poly(d), and (2) a sampling step which runs in poly(d) time, independent of the number of rows n. We achieve this by introducing a new regularized determinantal point process (R-DPP), which serves as an intermediate distribution in the sampling procedure by reducing the number of rows from n to poly(d). Crucially, this intermediate distribution does not distort the probabilities of the target sample. Our key novelty in defining the R-DPP is the use of a Poisson random variable for controlling the probabilities of different subset sizes, leading to new determinantal formulas such as the normalization constant for this distribution. Our algorithm has applications in many diverse areas where determinantal point processes have been used, such as machine learning, stochastic optimization, data summarization and low-rank matrix reconstruction.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Determinantal point processes (DPP) form a family of distributions sampling diverse subsets of points from a given domain, where the diversity is measured by the squared volume of the parallelopiped spanned by the points in some predefined space (such as ). DPPs have found applications as a model for repulsive sampling in a variety of fields such as physics [29], statistics [5], machine learning [26], computational geometry [15], graph theory [21] and others. Here we mention a few such applications as examples:

  1. Data summarization and diverse recommendation [18, 19, 17, 28]: selecting a representative sample of items, e.g. documents in a corpus, frames in a video or products in an online store.

  2. Row-based low-rank matrix reconstruction [15, 14, 20]: determinantal sampling is the optimal way of selecting few rows of a matrix that preserve its low-rank approximation.

  3. Stochastic optimization and Monte Carlo sampling [36, 4]

    : DPP sampling has been used to reduce the variance inherent in i.i.d. sampling and improve convergence.

Let be a tall and thin matrix, i.e. . Determinantal point process is defined as a distribution over all subsets such that

(1)

where denotes the submatrix of containing rows indexed by . DPP algorithms are typically divided into a preprocessing step, which needs to be performed once per given matrix , and a sampling step, which happens each time we wish to sample set . In this paper, we improve on the best known time complexity of both steps by demonstrating:

  1. the first DPP algorithm s.t. preprocessing takes input sparsity time: ;

  2. the first exact DPP algorithm s.t. the sampling step takes time, independent of .

Here, denotes the number of non-zero entries. Before this work, the best known DPP algorithms (see Section 2 and references therein) had preprocessing times , or if we allow conditioning on a fixed subset size (which can be as large as ), and sampling times at least

. Our DPP algorithm is based on the following general recipe for sampling from some “target” joint distribution:  

generate a larger sample of rows from an “intermediate” distribution;   downsample to a smaller subset using the “target” distribution. Crucially, the intermediate sampling is not allowed to distort the target distribution:

A simplified version of this approach was recently suggested by [13]. They sample from a different but related family of determinantal distributions called size volume sampling, which has the unique property that it can play the role of both the “intermediate” and the “target” distribution. DPPs on the other hand do not have that property and no candidate for an intermediate distribution was previously known. To that end, we develop a new family of joint distributions: regularized determinantal point processes (R-DPP). For a p.s.d. matrix and a Poisson mean parameter an samples over all sequences so that

Rescaling with Poisson probabilities is essential for the normalization constant of this distribution to have a closed form (a key part of our analysis). We obtain it via the following new determinantal formula: if is a sequence of i.i.d. samples from , then

where the expectations are over the random variable as well as the i.i.d. samples. Similar formulas have been known for fixed length but only in the unregularized case where (see [12, 35]

). Our result shows that regularization can be introduced by randomizing the sequence length with Poisson distribution. In Section 

4 we use this formula to show a number of connections between R-DPPs and DPPs. For example, the latter can be obtained from the former when the Poisson mean parameter converges to 0 along with regularization set to :

This means that the family of R-DPPs contains DPPs in its closure and can thus be viewed as an extension. We also show that DPPs are preserved under subsampling with an R-DPP:

This suggests that R-DPP is a good candidate for an “intermediate” distribution when sampling a DPP. To make sampling from R-DPPs efficient, we further generalize them so that the marginal row probabilities can be reweighted with an arbitrary i.i.d. distribution. On a very high level, our strategy is to show that when the length of sequence is sufficiently large in expectation (but still independent of ) and the R-DPP is reweighted by ridge leverage scores, then it becomes very close to i.i.d. sampling, so we can use that as a proposal distribution for a rejection sampling scheme.

In the following section we give some background and related work on DPP algorithms, then in Section 3 we present our main algorithm (Algorithm 2) and the associated result (Theorem 1), along with an example application in low-rank matrix reconstruction. Section 4 introduces R-DPPs along with their basic properties and the remaining Sections 5 and 6 are devoted to proving Theorem 1.

2 Background and related work

Several settings have been used in the literature for studying the complexity of DPP algorithms. Below we review those which are slightly different than the one we presented in Section 1, but are still relevant to our discussion (for a more in-depth survey see [26]).
-ensemble  We defined a DPP in terms of a matrix , where each element

is described by a row vector

(this was suggested by [15] and [24]). An equivalent parameterization can be defined in terms of the so-called ensemble matrix , where the th entry represents the dot product . In this case, the probability (1) of a subset can be written as , where denotes the submatrix of with both row and column entries indexed by . Naturally, one representation can be converted to the other in preprocessing, so our results are still useful in the -ensemble case when . However, since matrix is much larger than , the preprocessing costs can differ considerably.
k-DPP  In some practical applications, when a subset of particular size is desired, a DPP can be restricted only to subsets such that for some , and referred to as a k-DPP [25]. This is equivalent to sampling a set from a standard DPP, but accepting the sample only if . This distribution is also sometimes called size volume sampling (see, e.g., [15, 14]), not to be confused with size volume sampling mentioned in Section 1. The special case of will be further discussed in Section 2.2. An alternative way of controlling the subset size is by rescaling matrix with some so that the expected subset size for , i.e.  matches a desired value (see Section 3 for more details). A restriction to k-DPP can lead to faster sampling algorithms (when is small), as discussed in the following sections.

A classical DPP algorithm introduced by [22] (see also [5, 26]

) uses the singular value decomposition (SVD) of either

or to produce an exact sample from the DPP in time . However, this runtime does not include the cost of SVD, considered as a preprocessing step, which takes and for and , respectively. Since then, a number of methods were proposed to improve on this basic approach. We survey these techniques in the following two sections, and present a runtime comparison in Tables 1 and 2 (we omit the big-O notation in the tables).

preprocessing
SVD
sketching [30]
MCMC [4]
this paper
Table 1: Preprocessing costs for approximate k-DPPs from (omitting log terms), compared to our DPP algorithm and to SVD.
sampling
bottom-up [22]
i.i.d.+top-down [11, 13]
MCMC [4]
this paper
Table 2: Sampling cost (after preprocessing) for DPP/k-DPP methods, compared to our DPP algorithm ().

2.1 Previous approximate preprocessing techniques

Approximate preprocessing methods were studied primarily for k-DPPs (i.e. ) rather than for standard DPPs, because bounding the subset size makes volume approximations easier to control. In [14, 18] it was suggested to use volume-preserving sketching of [30] to reduce the dimension of matrix to , which allows -approximate sampling from a k-DPP, i.e. for all . Here, the cost of sketching is , and it is followed by computing SVD of an matrix in time . Also, [4] proposed to use a fast-mixing MCMC algorithm whose stationary distribution is a k-DPP, where preprocessing cost is for matrix , and for matrix (sampling time is similar). Other approximation techniques such as Nystrom [1] and coresets [27] yield approximate DPP distributions whose accuracy is incomparable due to data dependence. Table 1 compares the preprocessing costs for the data-independent -approximate k-DPP methods with that of our DPP algorithm. Note that our approach is specifically designed for sampling from a full DPP, not a k-DPP. Also, unlike these -approximate methods our algorithm samples from a proper determinantal process.

2.2 Sampling a DPP as a mixture of volume samples

Any DPP is a mixture of so-called elementary DPPs [22, 26]. These elementary DPPs have been independently studied in the context of volume sampling [3, 11, 13]. Given , we define

The mixture decomposition shown by [22] implies that DPP sampling can be divided into two steps: first sample one element from the mixture, then generate a sample from that elementary DPP. Specifically, consider the eigendecompositions and

. For convenience, let us put the eigenvectors of

into a matrix . Then can be produced by sampling a subset of eigenvector indices and then performing volume sampling w.r.t. the matrix constructed from those vectors:

Here, we used vectors from the decomposition of , but this can be easily obtained from the decomposition of during preprocessing, since . So given the eigendecomposition we can sample the set easily, and it remains to perform the volume sampling step.

  input: , s.t. 
  output:
  for
   Sample
   
  end for
  return  
Algorithm 1 Bottom-up volume sampling

In Table 2 we review the running times for different approaches of sampling (compared to an MCMC-based DPP sampler [4] and our algorithm). The classical approach from DPP literature [22, 5, 26] samples “bottom-up”, adding one point at a time and at each step projecting the remaining points onto the subspace orthogonal to that point (see Algorithm 1). Curiously, a diametrically opposed “top-down” approach of [11], which eliminates points one at a time instead of adding them, achieves the same asymptotic runtime with high probability. The volume sampling algorithm of [13] further improves on this downsampling strategy by introducing an intermediate i.i.d. oversampling step, which is what inspired our approach. Note that we are the first to apply the algorithms of [11, 13] to DPP sampling (they were previously known only in the context of volume sampling). Unfortunately, in all of these methods sampling time is linear111A different strategy was proposed in [27], which uses coreset construction to approximately sample from a DPP in time independent of , however the sampling accuracy is data-dependent, and so incomparable. in (they have to read the matrix for each sampled set) which may not be acceptable when and we need to perform the sampling repeatedly. In contrast, our algorithm, which uses the mixture decomposition only as a subroutine, samples in time independent of .

3 Main result

As discussed in Section 1, the high level strategy of our algorithm is to design an “intermediate” sampling distribution, which reduces the size of the matrix from to while preserving the “target” distribution (here, a DPP). Another key property of the intermediate distribution is that it has be close to i.i.d., so that we can implement it efficiently. Thus, our algorithm will use an i.i.d. sampling distribution defined by a vector of importance weights assigned to each row of , and use it as a proposal for rejection sampling from the intermediate distribution. Overall, the three main components of our method are:

In Section 4 we define a regularized determinantal point process (R-DPP) and use it as the intermediate distribution. For the i.i.d. sampling weights we use a particular variant of so-called ridge leverage scores [2]. Let denote the th row of and w.l.o.g. assume that all rows are non-zero.

1:  input: ,  ,  sampling oracle for
2:  repeat
3:   sample ,                          for , 
4:   sample ,               for (rejection sampling via )
5:   sample ,      for ,  
6:  until (if , then is distributed as an R-DPP)
7:  return  , where
Algorithm 2   Fast determinantal point process sampling

Other than matrix , the algorithm takes additional inputs: matrix and a sampling oracle for which approximates . The inputs and are computed in the preprocessing step, which can be easily performed in time , but in Section 6 we show how standard sketching techniques can be used to achieve input sparsity time preprocessing. In line 7 of the algorithm we invoke a different DPP sampling procedure for a matrix of reduced size. This can be for example the classical algorithm discussed in Section 2.2 with the volume sampling part implemented by Algorithm 1 and SVD performed exactly. Our main result shows that Algorithm 2 samples correctly from a determinantal point process and runs in time independent of .

Theorem 1.

Given , and such that

(2)
(3)

Algorithm 2 returns , where , and w.p. at least its time complexity is , where is the expected subset size for . Given , there is an algorithm that w.h.p. returns and satisfying (2) and (3) in time .

The only trade-off coming from approximate preprocessing in the algorithm is that it samples from instead of for some , which in fact has a closed form (see Lemma 6), but is more expensive to compute exactly. This rescaling only affects the distribution of sample size (and not the relative probabilities of sets of equal size), making it a much stronger and more useful guarantee than typical approximate sampling where every probability in the distribution can be arbitrarily distorted up to some factor.

The proof of Theorem 1 is spread out across Sections 4, 5 and 6. In Section 4 we define regularized determinantal point processes and show how they can be used for sampling DPPs. Then, in Section 5 we show that Algorithm 2 indeed samples from . Finally, we bound the time complexity of the algorithm and of the preprocessing in Section 6.

Application: row-based low-rank matrix reconstruction

One application of DPPs aims to find a small subset of rows of matrix such that the subspace spanned by those rows captures the full matrix nearly as well as the best rank- approximation in terms of the Frobenius norm . Let be the projection matrix onto the span of vectors . In [20] it was shown that if then for any ,

and that for any the bound is tight up to lower order terms.222Other methods are known for this and related tasks which achieve near-optimal bounds (e.g., see [15, 6, 7]). In particular, for the ratio to become , we need rows sampled from a DPP. Even though it is most natural to use fixed-size DPPs in this context, the sample size for a standard DPP is sufficiently concentrated around its mean to offer near-optimal guarantees for this task. In fact, as shown by [33] (see the tail bound of Theorem 3.5), for , if then w.p.  we have for some absolute constant . The expected sample size can be derived as:

(4)

Since is monotonic w.r.t. the eigenvalues we can use a simple binary search to find a rescaling for which . Thus, if , then for , where , we have

We can obtain the same guarantee if we replace the eigenvalues of in (4) with those of matrix satisfying condition (2), so by Theorem 1 the total cost of obtaining such a sample would be333If we forgo exact DPP sampling, then projection-cost preserving sketches [10] may offer further speed-ups. . This raises the following natural question: can our techniques be extended to sampling from fixed-size DPPs, so that the optimal sample size can be achieved exactly and in time ? We leave this as a new direction for future work.

4 Regularized determinantal point processes (R-DPP)

We propose a new family of determinantal sampling distributions which will be used in Sections 5 and 6 to prove Theorem 1. The crucial property of this family is that the determinantal sampling probabilities can be regularized by adding an arbitrary fixed positive semi-definite (p.s.d.) matrix inside of the determinant, while maintaining many of the natural properties of a DPP, such as a simple normalization constant. This is achieved by controlling the size of the sample with a Poisson random variable. In the proofs that follow we will use the shorthand .

Definition 1.

Given matrix , distribution , p.s.d. matrix and , we define as a distribution over all index sequences , s.t.

(5)

Whenever is uniform, we will write . Of course, we need to establish that this is in fact a valid distribution, i.e. that it sums to one. We achieve this by showing a new variant of the classical Cauchy-Binet formula (the classical formula is stated in (6) below).

Lemma 2.

Given and p.s.d. , if , then

Remark 3.

The classical Cauchy-Binet formula states that for a matrix , we have

(6)

Probabilistic extensions of the formula previously appeared in the context of volume sampling in [13, 12], and also much earlier in a different context [35] (in these cases and is fixed).

Proof.

Previously shown identities for fixed size do not generalize naturally to the regularized setting unless randomness in is introduced. We start the proof by applying the Cauchy-Binet formula (6) to the term under the expectation. Let be any decomposition of s.t.  where . To apply the Cauchy-Binet formula, we sum over all -element subsets of the union of rows of matrices and :

Applying the law of total expectation w.r.t. the Poisson variable , we obtain

where follows from the exchangeability of sequence (so that the value of the expectation is the same for any subset ), in we expand the expectation and note that only unique sequences will have a non-zero determinant (hence the switch to subsets and the term ), in we absorb the factors and into the determinant, and finally is the classical Cauchy-Binet. It was crucial that we were able to absorb the subset size into the Poisson series, which allowed the formula to collapse to a single determinant. This completes the proof because .

In what sense is R-DPP a natural extension of a determinantal point process? Naively, we might say that setting the matrix

to an all-zeros matrix would recover the classical distribution, however this is not the case because when

only samples of size or larger will have non-zero probability. Instead, we can demonstrate a connection to both DPP and Volume Sampling distributions in a different way: we show that they can be obtained as the limiting distributions of R-DPP when the regularization and sample size parameter converge to zero (in two separate ways), which is remarkable as the two distributions in most cases produce vastly different samples.

Theorem 4.

For , ignoring the ordering in the sequences sampled with R-DPP, we have

(7)
(8)
Proof.

To show (7) we use a fact which is a simple consequence of the Sylvester’s theorem, namely that for a set of size . It follows that

where should be interpreted as if was an unordered multiset. Next, we prove (8):

because whenever .

Theorem 4 suggests that R-DPPs are likely to be of independent interest as an extension of DPPs. However, it does not say how to use them algorithmically. To that end, we make a second observation, which essentially states that DPPs are preserved under subsampling with R-DPPs.

Theorem 5.

For any , and distribution over s.t. , let denote matrix with th row rescaled by for every . It follows that for any ,

Proof.

Using the law of total probability we compute the probability of sampling set

of size :

where the cancellation in follows from Sylvester’s Theorem, and in we use the exchangeability of sequence to observe that for any subset of size the value of the proceeding sum is the same (the factor counts the number of such subsets and counts the number of sequences of length that correspond to set ).

5 Correctness of the main algorithm

We present the first part of the proof of Theorem 1 by establishing that Algorithm 2 produces a sample from a determinantal point process. In fact, we will prove the following more precise claim:

Lemma 6.

Given and a non-zero p.s.d. matrix , Algorithm 2 returns , with where and .

Proof.

The general idea of the proof is to show that the main repeat loop is implementing an R-DPP, so that we can invoke Theorem 5. Central to this fact is the choice of numerical factor appearing in the denominator of Bernoulli sampling probability in line 5, which we denote here as . This factor has to depend on because otherwise the determinantal term will always dominate it for large enough . This means that needs to be carefully chosen so that:

  1. the acceptance probability of line 5 is always bounded by 1,

  2. we have control over how the presence of changes the distribution of ,

  3. and is not too large so that we may have a good chance of accepting the sample.

We start by showing that the Bernoulli sampling probability in line 5 is in fact bounded by 1. We will use the following simple inequality (proven in Appendix A):

Lemma 7.

For any , , and non-negative integers s.t. we have

Let be the matrix where for each the th row is rescaled by , with (same as in line 5 of the algorithm), and let

. We use arithmetic-geometric mean inequality for the eigenvalues of matrix

and the fact that , obtaining:

where follows from Lemma 7 invoked with and . Having established the validity of the rejection sampling in Algorithm 2, we now compute the distribution of sample at the point of exiting the repeat loop. Denoting as the desired Poisson mean parameter and as the normalization for the sampling probabilities in line 4,