In this section we introduce notation and some basic results.
We deal with finite ground sets, so without loss of generality we may take . Fixed subsets of are then equivalent to multi-indices and noted , with cardinality noted . Random subsets are noted or . Expectation is noted , and is the indicator function, so that e.g., . There are two equivalent viewpoints when dealing with finite random subsets: one is to look at , a subset, as the random variable. Another is to consider binary strings of length , which indicate whether item is included in . We note such strings , and depending on context one or the other viewpoint is more convenient. Matrices are in bold capitals, e.g.,
. The identity matrix is noted. Individual entries in a matrix are noted using capitals: is entry in matrix . Sub-matrices are in bold, with indices, for example is the sub-matrix of with rows indexed by and columns indexed by . So-called “Matlab” notation is used occasionally, so that the submatrix formed by selecting all rows in is noted , and is the submatrix containing the first columns. For simplicity, a single index is used if it is repeated:
. Sub-matrices and sub-vectors formed by excluding elements are noted with a minus sign, e.g., the indexincludes all elements in except index .
1.2 Some lemmas
We will need two well-known lemmas in the course of this work. The first one (Cauchy-Binet) is central to the theory of DPPs, the second is an easy lemma on inclusion probabilities.
The Cauchy-Binet lemma expresses the determinant of a matrix product as a sum of products of determinants:
Lemma 1.1 (Cauchy-Binet).
Let , with a matrix, a matrix. We assume . Then:
where is a multi-index of length . The sum is over all multi-indices , of which there are .
The second lemma is an easy lemma on sums of inclusion probabilities. An inclusion probability is the probability that a certain item (or items) appear in a random set.
Lemma 1.2 (Sums of inclusion probabilities).
Let designate a base set of items, and a random subset of . Let designate a fixed subset of items of cardinality . is called an inclusion probability. We have that: , where the expectation is over the random set . In particular:
if is a set of fixed size , the sum equals .
if , the sum equals
For large sets, As a consequence, the sum of order- inclusion probabilities for a set of fixed size is . We use this fact to properly normalise the total variation distance, see section 2.2.
1.3 Elementary symmetric polynomials
The Elementary Symmetric Polynomials (ESPs) of a matrix play an important role in the theory of -DPPs, and one of our core problems will be to find asymptotic formulas for them. Let denote a positive definite matrix and
its eigenvalues. The-th ESP is a sum of all the products of eigenvalues:
For example, . Interesting special cases include , and . There is a rich theory on ESPs, going back at least to Newton, with interesting modern developments (Mariet and Sra, 2017; Jozsa and Mitchison, 2015). As we explain below, they occur in -DPPs as normalisation constants, and ratios of ESPs appear in inclusion probabilities.
DPPs are defined such as to produce random subsets that are not overly redundant, where the notion of redundancy is defined with respect to a (positive definite) similarity function.
We have a collection of items ordered from 1 to . We associate to each pair of items a similarity score , such that the matrix with entries is positive definite. The matrix is called the L-ensemble of the DPP 222We find it more natural to define DPPs via the L-ensemble, since the more common definition via the marginal kernel does not carry over to fixed-size DPPs..
A Determinantal Point Process is a random subset of with probability mass function given by:
The preference for diverse subsets built into DPPs comes from the fact that if a subset includes items that are too similar, the matrix will have nearly colinear columns, and its determinant will be close to 0.
An interesting aspect of DPPs is how tractable the marginals are. The inclusion probabilities, i.e., the probability that item is in , are given by the so-called “marginal kernel” matrix , where
Specifically, for a DPP, . More generally, inclusion probabilities are given by principal minors of the marginal kernel, e.g., if is a subset of :
A DPP can generate random subsets of any size from 1 to . The expected cardinality of can also be read out from the marginal kernel, specifically:
where the ’s designate the eigenvalues of the L-ensemble .
A -DPP is a DPP conditioned on the size of the sampled set . In other words, the probability mass function stays the same but now the sample space is the set of subsets of of size , and
Contrary to DPPs, -DPPs are insensitive to the overall scaling of the L-ensemble. Since
the probability density (1.7) is invariant to any rescaling by a factor .
An important property of -DPPs, one that unlocks many analytical simplifications, is that -DPPs are a mixture distribution. The mixture involves a diagonal -DPP and a projection -DPP, two objects that are simpler than a generic -DPP.
The mixture property is a consequence of the Cauchy-Binet formula (lemma 1.1). Let denote the spectral decomposition of , with and
the matrix of eigenvectors. Then
where is an integration constant (to be defined later), is a subset of columns of , and the sum is over all such subsets of size . Equation (1.8) shows that the probability mass function has the form of a mixture distribution, where we first choose a set of eigenvalues (with indices ) from a -DPP with diagonal L-ensemble and then choose a set of items from a -DPP with L-ensemble . The latter is a specific kind of DPP, called a “projection DPP” .
The same mixture interpretation holds for DPPs as well. In the case of DPPs, the rule for sampling the set of eigenvalues is simpler. Each eigenvalue is sampled independently and included with probability . Once we have the eigenvalues, we proceed in exactly the same way as above: form a projection kernel, and sample the corresponding projection DPP.
1.5.1 Projection DPPs
A projection DPP is a -DPP whose L-ensemble has the following form:
where has orthonormal columns (i.e., ).
Projection DPPs have a set of properties that make them especially tractable. The most salient is that the marginal kernel equals the L-ensemble, e.g., the inclusion probability of item equals , as shown in the following lemma.
In a projection DPP with L-ensemble , .
See appendix. ∎
This result is proved rigorously in the appendix, but straightforward if one looks at projection DPPs as DPPs taken to a certain limit. Consider a DPP with the following L-matrix, indexed by parameter :
where is a diagonal matrix with entries on the diagonal equal to repeated times, followed by , repeated times, and is a orthonormal matrix. Let . Following the mixture interpretation of DPPs, we see that the probability of picking one of the first eigenvalues equals , which tends to 1, while the probability of picking one of the latter tends to 0. This means that with increasing we end up always picking the same eigenvalues, and hence always sampling the same -DPP, one with kernel . The marginal probabilities are given by the corresponding marginal kernel: where has first entries equal to , and the next equal to . In the large- limit, the marginal kernel thus equals as claimed. The limit is however improper, as some entries in the L-matrix tend to infinity.
To sum up: if the L-ensemble is a projection matrix of rank , then a -DPP is also a DPP. We can even extend this further to all L-ensembles of rank .
Let have rank , with eigendecomposition . Without loss of generality, we assume that is of size and a diagonal matrix of size with non-null diagonal elements. Then a -DPP with L-ensemble is also a projection DPP, with marginal kernel equal to .
has rank , so in the eigendecomposition is , and is a diagonal matrix of size . If is a subset of size , we have
and since the matrices involved are square, we have:
Then , which is the probability mass function of a projection DPP and the result follows. ∎
This result hints at a close kinship between -DPPs and DPPs, and convergence results bear this out.
1.5.2 Inclusion probabilities in -DPPs
Since a -DPP is a mixture of projection-DPPs (eq. 1.8), the first order inclusion probability for item can be expressed as
where , the probability that the -th eigenvector is included in set . Formulas for higher-orders (joint inclusion probabilities) are in section A.2.
Computing the inclusion probabilities for a -DPP thus boils down to computing inclusion probabilities in a diagonal -DPP, and combining them with the eigenvectors of .
1.6 Diagonal DPPs and -DPPs
In the special case of diagonal DPPs and -DPPs, the L-ensemble is a diagonal matrix. A diagonal DPP turns out to be nothing more than a Bernoulli process. If conditioned to be of fixed size , a diagonal -DPP is obtained.
So far we have kept with the usual viewpoint on DPPs, which sees them as random sets. Alternatively, a sample from a discrete DPP can be viewed as a binary string of size , where indicates inclusion of the -th item, and . In this section we prefer the latter viewpoint, because it lightens notation.
In this notation the inclusion probability of item equals the marginal probability of , , and similarly for joint probabilities , etc. is the likelihood of the draw.
1.6.1 Diagonal DPPs
Consider a DPP with diagonal L-ensemble
Following eq. (1.4), is diagonal too, with entries . The fact that the marginal kernel is diagonal implies that , with similar results for higher-order probabilities. We conclude that (viewed as a binary string) a diagonal DPP is a product of independent Bernoulli variables, where each is drawn with probability .
1.6.2 Diagonal -DPPs
Viewed as distributions over binary strings, diagonal DPPs are a product measure, meaning that each is sampled independently. Diagonal -DPPs are not, due to the constraint that . The density of a diagonal -DPP is given by:
The integration constant is given by the ’th elementary symmetric polynomial (ESP)
where is a multi-index of size
. At this stage, it may be hard to see what sort of probability distribution eq. (1.15) defines. Indeed, it is not obvious how to sample from such a distribution, and the algorithm given in Kulesza and Taskar (2012) is not trivial. We return to the issue in section 3.3.1.
Inclusion probabilities can be computed through direct summation.
Computing such quantities in practice is again not completely trivial, although (Kulesza and Taskar, 2012) gives an algorithm. We include a fairly accurate approximation below, and due to numerical instabilities in the exact algorithm, we advocate using the approximation in most cases (Section 4).
2 Asymptotic equivalence of -DPPs and DPPs
Before stating our main results formally, we give an intuitive argument as to why -DPPs and DPPs may resemble one another.
2.1 Some intuition
Readers familiar with statistical physics will know of a class of results known as “equivalence of ensembles” (Touchette, 2015). These results justify formally a mathematical subterfuge, whereby a probability distribution that incorporates a hard constraint (the “micro-canonical ensemble”) can be replaced with a more tractable variant (the “canonical ensemble”), where the hard constraint is turned into a soft constraint. Our result is a variant of this particular scenario.
We rewrite the likelihood of a -DPP as the likelihood of a DPP times a hard constraint:
Deploy now the usual trick of turning the hard constraint into a soft constraint via an exponential, defining a new distribution:
where should be set so that on average over , i.e., . Before we find such a value, it helps to recognise that actually has the form of a DPP: since , we have
and we identify as a DPP with L-ensemble . Using eq. (1.6), we find that:
The appropriate value for is determined by the implicit equation that . In terms of the eigenvalues, this reads:
To sum up, this development suggests that a -DPP with ensemble can be approximated by a (tilted) DPP with L-ensemble , with set so that the matched DPP has elements on average. The next section gives a rigorous statement for this approximation.
2.2 Main result
Under certain conditions, DPPs and -DPPs are equivalent in a regime where we pick a fixed ratio of items from a growing set, i.e., , fixed as . By equivalence, we mean that they have the same marginals (inclusion probabilities of order 1 and above). The conditions for equivalence boil down to the number of degrees of freedom of
number of degrees of freedom ofbeing high enough, and we make that condition more precise below. In practice the approximations we derive give excellent results in most settings we have tried, except with very small values of (less than 10, say).
We require assumptions on the L-ensembles: let denote a sequence of positive definite matrices of increasing size . The assumption is that diverges. The question of which sequences of matrices verify this condition is left to section 2.3.
We associate with each a -DPP , where , a fixed fraction of the number of items. Similarly, we have a second sequence of matched DPPs with L-ensemble , where verifies eq. (2.4). Let denote a multi-index of fixed finite size , and the probability that , and the corresponding probability for . We may interpret and as two measures over , and an appropriate means of comparing these quantities is via total variation. Because and have total mass that grows with (see lemma 1.2), we normalise the total variation distance with the appropriate factor.
Let , designate two inclusion measures of order , corresponding to inclusion probabilities in point processes with elements. We define their total variation distance as:
We have the following result:
Under the assumptions above, joint inclusion probabilities under a -DPP and its matched DPP converge:
Note that in our proof we have , which is needed because of a Central Limit argument implicit in the saddlepoint expansion.
A quantity of interest in many calculations are sample averages of the form . Then . An easy corollary is that , from well-known properties of the total variation distance (DasGupta, 2008).
The overall proof path for theorem 2.1 is as follows:
2.2.1 Reduction to diagonal DPPs
Recall (section 1.5) that DPPs and -DPPs are both mixture distributions, where we first draw a set of eigenvectors of , and then draw from a projection DPP formed from these eigenvectors. That second step is the same in DPPs and -DPPs, only the first step differs. In DPPs, we draw from a diagonal DPP, while in -DPPs we draw from a diagonal
-DPP. Heuristically, because it is only the first step that differs, we can focus on our asymptotic study on the first step.
Formally if we can establish that the inclusion probabilities in diagonal -DPPs and DPPs converge (at any finite order), then the inclusion probabilities in general -DPPs and DPPs converge as well (up to the same order). We note and the diagonal DPPs associated with and . The order- inclusion measures for and are noted and , while the corresponding measures for and are noted and (the latter correspond to the probability that certain eigenvectors are included, as per the mixture interpretation of DPPs introduced in section 1.5.1).
The following lemma states the result:
Lemma 2.1 implies that if diagonal -DPPs converge to matched diagonal DPPs, so do general k-DPPs. The proof is deferred to the appendix (section A.2). Armed with this lemma, we now focus only on the diagonal case.
Our goal is now to compute inclusion probabilities in diagonal -DPPs. Recall that denotes a subset of of fixed size . We wish to compute , or equivalently, the probability that . This marginal probability can be computed via direct summation:
Thus, inclusion probabilities in diagonal DPPs can be expressed using ratios of ESPs. This leads us to our next section, where we derive an asymptotic approximation for ESPs. We will then insert the asymptotic approximation into eq. (2.7), to get an asymptotic series for inclusion probabilities.
2.2.2 Saddlepoint approximation for ESPs
ESPs are unwieldy combinatorial objects, but fortunately they lend themselves well to asymptotic approximation. This section is crucial for the rest and so we keep the details in the main text.
ESPs have an elegant probabilistic interpretation (already noted in passing in (Chen, Dempster and Liu, 1994)). An equivalent definition for ESPs views them as the coefficients in a power series:
We borrow the notation from combinatorics to denote the coefficient of in the series . To uncover the probabilistic interpretation of ESPs, we transform the series into a probability generating function.
where is now to be interpreted as the parameter of a Bernoulli variable, . Let designate the sum of all such independent ’s. Then:
Since is the sum of independent random variables, it invites a central limit approximation to the . First, note that:
which tells us that , taken as a function of , is likely to peak near
. The second moment,
gives a measure of scale for the peak of around . Since , we have:
In studying the convergence of -DPPs and DPPs, it is , rather than
that captures the appropriate notion of “degrees of freedom”. In our case the Lyapunov Central Limit Theorem(Billingsley, 2008) requires that diverge asymptotically, and the condition we assumed on the sequence of L-ensembles guarantees exactly that (see section 2.3 for a discussion).
A much better approximation than the Gaussian CLT is the saddlepoint approximation of (Daniels, 1954). Unlike the CLT, it is accurate in the tails and has relative error. It reads:
where is the cumulant-generating function of , and is the solution of the saddlepoint equation:
In our case, we have:
We will need the derivatives of as well:
In large the exponential term dominates (a large deviation regime, see Touchette (2015)), and we have:
At this stage, we have a tractable approximation to ESPs, and we are now ready to use it to find an approximation for inclusion probabilities.
2.2.3 Inclusion probabilities, and ratios of ESPs
To study the asymptotics of inclusion probabilities, we insert approximation (2.14) into eq. (2.7), and compute the and terms. The calculation is lengthy and can be found in the appendix (section A.3). The end result is as follows:
In a diagonal k-DPP with L-ensemble , inclusion probabilities have the asymptotic form:
The terms appearing in the correction are defined in appendix A.3.
Notice that the term corresponds exactly to the inclusion probability in the matched diagonal DPP, . We now have all the elements we need to prove Theorem 2.1. Consider a -DPP with -th order inclusion probability . Let be the -th order inclusion probability of the matched DPP. Let the corresponding measure for the generating diagonal -DPP be , whose approximation is given by eq. (2.21). Starting with Lemma 2.1 and using the approximation leads to
where equality is due to equation (2.4) which implicitly defines , and equality holds because . This concludes the proof of the main result. A refinement is described in Appendix A.4, where we derive a tractable correction to multivariate inclusion probabilities.
A remark on the precise nature of the convergence result is in order. Regardless of how large is, a -DPP will continue to produce sets of fixed size, while a DPP will continue to produce sets of variable size. This implies that DPPs and -DPPs cannot be equivalent in the very strong sense of the respective probability mass functions agreeing on every possible set, since by definition they remain different. The result is of the same nature as equivalence of ensembles in statistical physics: it pertains to two different distributions that agree more and more as tends to infinity, but never agree completely. Practically speaking, an interpretation is that for a given , a -DPP and a matched DPP will have very similar moments up to a certain order: certainly, at order , this cannot be true, since the inclusion measure for the -DPP is uniformly zero, but that is not true for the DPP. To get agreement up to higher orders, one has to increase .
Besides the main result, another consequence of lemma 2.3 is that in importance sampling estimators of the form given by eq. (0.1) can be used with approximate rather than exact probabilities. Using the approximation induces order bias, and similarly using the correction induces order bias. Our recommendation is therefore that one samples -DPPs, rather than DPPs, while using the approximate inclusion probabilities in computations.
2.3 To which sequences of matrices does this apply?
We stated earlier that the result applies to any sequence of matrices whose degrees of freedom grow as a function of , with the more precise statement being that (see eq. (2.12)) should diverge. With the caveat that the condition is sufficient and not necessary, in what sort of scenarios can we expect it to hold?
A full discussion of the issue would require significant forays into random matrix theory and take us beyond the scope of the current work, so we only give a sufficient condition that is relatively easily checked. As mentioned in section 1.5, in -DPPs, the L-ensemble can be multiplied by an arbitrary positive constant without changing the distribution. This means that we are free to scale each by an arbitrary constant independently for each , a normalisation that lets us for instance set to 1 for all . For , , which implies that , and a sufficient condition for the theorem to apply is therefore that diverges.
To pick a practical scenario, consider “in-fill” asymptotics. We suppose that the original set of data is made up of vectors in sampled i.i.d. from a density . The L-ensemble used is the classical squared-exponential (Gaussian) kernel. Let , where . , and from the Gershgorin circle theorem we have a bound on that reads . A sufficient condition for convergence is then that diverges, which will not be the case for fixed . The reason is that essentially counts the number of points in a neighbourhood of size around , and that quantity is . To make diverge, we need to shrink with so that each point has neighbours. Similarly, the condition holds under so-called “increasing-domain” asymptotics, in which is fixed but we consider points in a growing window. It is likely that one could relax these criteria, but in any case we must emphasise that (a) the approximations work really well in practice, see section 4 and (b) actual simulations of -DPPs require L-ensembles that have effective rank quite a bit larger than , otherwise the numerical difficulties are overwhelming even though the process may be well defined.
2.4 Consequences for inference
DPPs are not only used for sampling, but also as statistical models for certain types of data that exhibit repulsion. Now in this case as well the modeller has to make a choice, and use either -DPPs or DPPs. The former seems to imply that the number of observations (which is the role played here by
) is known in advance, while the latter does not. Interestingly, the results above imply that the choice of fixed size or varying size is of no consequence, at least if maximum likelihood is used for inference, though we suspect that Bayesian inference would be the same in that regard. To be precise, what we have in mind here is a case in which we observe a setof size , assumed to have been drawn from a -DPP of matrix , where is a vector of parameters. For instance, may control the amount of repulsion in the point process. The log-likelihood of a -DPP is given by:
The corresponding Maximum Likelihood estimator of is noted:
Similarly, a DPP model would assume to be drawn from a DPP with L-ensemble , where controls the expected cardinality of the set. The log-likelihood reads in this case:
Since is effectively a nuisance parameter, we may use a profile likelihood:
The ML estimator of in this case solves:
To find a closed-form for the profile likelihood (eq. (2.25), we take the derivative of with respect to , to find:
where we recognise the saddlepoint equation in yet another form.
Equating the above to 0, we obtain:
From eq. (2.19) we know that:
which tells us that:
where the term in comes from the second derivative of in (2.19) and is expected to be small compared to . A full formal argument showing convergence of to is complicated, and amounts to showing that the term is constant in a relevant region around . Informally, however, what happens is quite clear: the two cost functions are close (up to a vertical shift), and if they are sufficiently well-behaved (as a function of ), then we expect . We verify this conjecture in a numerical example in section 4.3.
3 Algorithms and numerical results
The results above are interesting theoretically, but can also be used in practice to develop algorithms that compute (approximate) ESPs, sample diagonal -DPPs, and compute inclusion probabilities. We find empirically that although approximate, they are much better behaved numerically than their nominally exact counterpart.
3.1 Computing ESPs
The algorithm given in Kulesza and Taskar (2012) (alg. 7, p. 60) for computing ESPs of all orders is fast 333Their algorithm runs in , like ours, although theirs is faster in practice. but prone to numerical problems when is large, which is not completely surprising given that ESPs can vary over dozens of orders of magnitude. We find that the saddlepoint approximation given in eq. (2.19) is more practical, especially since it is naturally computed on a logarithmic scale, a perk exact algorithms do not share. To compute eq. (2.19), one needs to solve the saddlepoint equation (eq. (2.4)) for . Newton’s algorithm can be used (alg. 1), but it needs appropriate initialisation or it may not converge (when it does converge, it does so very fast). In our implementation, an initial guess for is found by linearising the saddlepoint equation for small (small ), and large (large ). In small , we have:
so that for small, we may approximate as:
In large we find:
which solving for results in:
We use the first guess for and the second otherwise, with good results. Interestingly, (3.1) and (3.2) can be used to find the worst-case relative error of the saddlepoint approximation, which is about 10%, a figure we verify in practice for all but the smallest . The saddlepoint approximation is at its worst far out in the tails, that is, for and . Recall that . We inject (3.1) into (2.19) and linearise to find:
so that the relative error is about . A similar calculation for yields the same figure.
To compute all ESPs, it is useful to begin at and then “warm-start” the optimisation rather than always use the same initial condition. The procedure is outlined in algorithm 2.