, and have since then received much attention, as they arise in many fields, e.g. random matrix theory, combinatorics, quantum physics. We refer the reader to[3, 4, 5] for detailed tutorial reviews, respectively aimed at audiences of machine learners, statisticians, and probabilists. More recently, DPPs have been considered as a modelling tool, see e.g. [4, 3, 6]: DPPs appear to be a natural alternative to Poisson processes when realizations should exhibit repulsiveness. In , for example, DPPs are used to model diversity among summary timelines in a large news corpus. In , DPPs model diversity among the results of a search engine for a given query. In , DPPs model the spatial repartition of trees in a forest, as similar trees compete for nutrients in the ground, and thus tend to grow away from each other. With these modelling applications comes the question of learning a DPP from data, either through a parametrized form [4, 7], or non-parametrically [8, 9]. We focus in this paper on parametric inference.
Similarly to the correlation between the function values in a Gaussian process , the repulsiveness in a DPP is defined through a kernel , which measures how much two points in a realization repel each other. The likelihood of a DPP involves the evaluation and the spectral decomposition of an operator defined through a kernel that is related to . There are two main issues that arise when performing likelihood-based inference for a DPP. First, the likelihood involves evaluating the kernel , while it is more natural to parametrize instead, and there is no easy link between the parameters of these two kernels. The second issue is that the spectral decomposition of the operator required in the likelihood evaluation is rarely available in practice, for computational or analytical reasons. For example, in the case of a large finite set of objects, as in the news corpus application , evaluating the likelihood once requires the eigendecomposition of a large matrix. Similarly, in the case of a continuous domain, as for the forest application , the spectral decomposition of the operator may not be analytically tractable for nontrivial choices of kernel . In this paper, we focus on the second issue, i.e., we provide likelihood-based inference methods that assume the kernel is parametrized, but that do not require any eigendecomposition, unlike . More specifically, our main contribution is to provide bounds on the likelihood of a DPP that do not depend on the spectral decomposition of the operator . For the finite case, we draw inspiration from bounds used for variational inference of Gaussian processes , and we extend these bounds to DPPs over continuous domains.
For ease of presentation, we first consider DPPs over finite sets of objects in Section 2, and we derive bounds on the likelihood. In Section 3, we plug these bounds into known inference paradigms: variational inference and Markov chain Monte Carlo inference. In Section 4, we extend our results to the case of a DPP over a continuous domain. Readers who are only interested in the finite case, or who are unfamiliar with operator theory, can safely skip Section 4 without missing our main points. In Section 5, we experimentally validate our results, before discussing their breadth in Section 6.
2 DPPs over finite sets
2.1 Definition and likelihood
Consider a discrete set of items , where
is a vector of attributes that describes item. Let be a symmetric positive definite kernel  on , and let be the Gram matrix of . The DPP of kernel
is defined as the probability distribution over all possiblesubsets such that
where denotes the sub-matrix of indexed by the elements of
. This distribution exists and is unique if and only if the eigenvalues ofare in . Intuitively, we can think of as encoding the amount of negative correlation, or “repulsiveness” between and . Indeed, as remarked in , (1) first yields that diagonal elements of are marginal probabilities: . Equation (1) then entails that and are likely to co-occur in a realization of if and only if
is large: off-diagonal terms in indicate whether points tend to co-occur.
Providing the eigenvalues of are further restricted to be in , the DPP of kernel has a likelihood . More specifically, writing for a realization of ,
where , is the identity matrix, and denotes the sub-matrix of indexed by the elements of . Now, given a realization , we would like to infer the parameters of kernel , say the parameters of a squared exponential kernel 
Since the trace of is the expected number of points in 
, one can estimateby the number of points in the data divided by . But , the parameter governing the repulsiveness, has to be fitted. If the number of items is large, likelihood-based methods such as maximum likelihood are too costly: each evaluation of (2) requires storage and time. Furthermore, valid choices of are constrained, since one needs to make sure the eigenvalues of remain in .
A partial work-around is to note that given any symmetric positive definite kernel , the likelihood (2) with matrix corresponds to a valid choice of , since the corresponding matrix necessarily has eigenvalues in , which makes sure the DPP exists . The work-around consists in directly parametrizing and inferring the kernel instead of , so that the numerator of (2) is cheap to evaluate, and parameters are less constrained. Note that this step favours tractability over interpretability of the inferred parameters: if we assume
the number of points and the repulsiveness of the points in do not decouple as nicely as when is squared exponential. For example, the expected number of items in depends on and now, and both parameters also affect repulsiveness. There is some work investigating approximations to to retain the more interpretable parametrization , but the machine learning literature [3, 7] almost exclusively adopts the more tractable parametrization of . In this paper, we also make this choice of parametrizing directly.
Now, the only expensive step in the evaluation of (2) is the computation of . While this still prevents the application of maximum likelihood, bounds on this determinant can be used in a variational approach or an MCMC algorithm, for example. In , bounds on are proposed, requiring only the first eigenvalues of , where is chosen adaptively at each MCMC iteration to make the acceptance decision possible. This still requires the application of power iteration methods, which are limited to the finite domain case, require storing the whole matrix , and are prohibitively slow when the number of required eigenvalues is large.
2.2 Nonspectral bounds on the likelihood
Let us denote by the submatrix of where row indices correspond to the elements of , and column indices to those of . When , we simply write for , and we drop the subscript when . Drawing inspiration from sparse approximations to Gaussian processes using inducing variables , we let be an arbitrary set of points in , and we approximate by . Note that we do not constrain to belong to , so that our bounds do not rely on a Nyström-type approximation . We term “pseudo-inputs”, or “inducing inputs”.
The right inequality is a straightforward consequence of the Schur complement being positive semidefinite. For instance, once could remark that for ,
A change of variables using the Cholesky decompositions of and yields the desired inequality.
The left inequality in (3) can be proved along the lines of , using variational arguments (as will be discussed in detail in Section 3.1.1). Next we give an alternative, more direct proof based on an inequality on determinants [14, Theorem 1]. For any real symmetric matrix , define its absolute value as . In particular, for a positive semidefinite , . Applying [14, Theorem 1] and noting that , and are positive semidefinite, it comes
Now, denote by the eigenvalues of , which are all nonnegative. It comes
3 Learning a DPP using bounds
In this section, we explain how to run variational inference and Markov chain Monte Carlo methods using the bounds in Proposition 1. In this section, we also make connections with variational sparse Gaussian processes more explicit.
3.1 Variational inference
Let be an arbitrary set of points in . Proposition 1 then yields a lower bound
The lower bound can be computed efficiently in time. Instead of maximizing , we can maximize jointly with respect to the kernel parameters and the variational parameters .
To maximize (7), one can e.g. implement an EM-like scheme, alternately optimizing in and . Kernels are often differentiable with respect to , and sometimes will also be differentiable with respect to the pseudo-inputs , so that gradient-based methods can help. In the general case, black-box optimizers such as CMA-ES , can also be employed.
3.1.1 Connections with variational inference in sparse GPs
We now discuss the connection between the variational lower bound in Proposition 1 and the variational lower bound used in sparse Gaussian process (GP; ) models . Although we do not explore this in the current paper, this connection could extend the repertoire of variational inference algorithms for DPPs by including, for instance, stochastic optimization variants.
Assume function follows a GP distribution with zero mean function and kernel function , so that the vector of function values evaluated at
follows the Gaussian distribution. Then, the standard Gaussian integral yields111Notice that , and (8) follows.
Following this interpretation, we augment the vector with a vector of extra auxiliary function values , referred to as inducing variables, evaluated at the inducing points so that jointly follows
Now by using the fact that , the integral in (8) can be expanded so that
We can bound the above integral using Jensen’s inequality and the variational distribution , where is a marginal variational distribution over the inducing variables . This form of variational distribution is exactly the one used for sparse GPs , and by treating the factor optimally we can recover the left lower bound in Proposition 1, following the lines of . We provide details in Appendix A.
The above connection suggests that much of the technology developed for speeding up GPs can be transferred to DPPs. For instance, if we explicitly represent the variational distribution in the above formulation, then we can develop stochastic variational inference variants for learning DPPs based on data subsampling . In other words, we can apply to DPPs stochastic variational inference algorithms for sparse GPs such as .
3.2 Markov chain Monte Carlo inference
, Bayesian inference relies on the posterior distribution, where the log likelihood is defined in (6). A standard approach to sample approximately from is the Metropolis-Hastings algorithm (MH; [18, Chapter 7.3]). MH consists in building an ergodic Markov chain of invariant distribution . Given a proposal , the MH algorithm starts its chain at a user-defined , then at iteration it proposes a candidate state and sets to with probability
while is otherwise set to . The core of the algorithm is thus to draw a Bernoulli variable with parameter for . This is typically implemented by drawing a uniform and checking whether . In our DPP application, we cannot evaluate . But we can use Proposition 1 to build a lower and an upper bound , which can be arbitrarily refined by increasing the cardinality of and optimizing over . We can thus build a lower and upper bound for
Now, another way to draw a Bernoulli variable with parameter is to first draw , and then refine the bounds in (12), by augmenting the numbers , of inducing variables and optimizing over , until is out of the interval formed by the bounds in (12). Then one can decide whether . This Bernoulli trick is sometimes named retrospective sampling and has been suggested as early as . It has been used within MH for inference on DPPs with spectral bounds in , we simply adapt it to our non-spectral bounds.
4 The case of continuous DPPs
DPPs can be defined over very general spaces . We limit ourselves here to point processes on such that one can extend the notion of likelihood. In particular, we define here a DPP on as in [1, Example 5.4(c)], by defining its Janossy density. For definitions of traces and determinants of operators, we follow [20, Section VII].
Let be a measure on that is continuous with respect to the Lebesgue measure, with density . Let be a symmetric positive definite kernel. defines a self-adjoint operator on through Assume is trace-class, and
We assume (13) to avoid technicalities. Proving (13) can be done by requiring various assumptions on and . Under the assumptions of Mercer’s theorem, for instance, (13) will be satisfied [20, Section VII, Theorem 2.3]. More generally, the assumptions of [21, Theorem 2.12] apply to kernels over noncompact domains, in particular the Gaussian kernel with Gaussian base measure that is often used in practice. We denote by the eigenvalues of the compact operator . There exists [1, Example 5.4(c)] a simple222i.e., for which all points in a realization are distinct. point process on such that
where is the open ball of center and radius , and where is the Fredholm determinant of operator [20, Section VII]. Such a process is called the determinantal point process associated to kernel and base measure .333There is a notion of kernel for general DPPs , but we define directly here, for the sake of simplicity. The interpretability issues of using instead of are the same as for the finite case, see Sections 2 and 5. Equation (14) is the continuous equivalent of (2). Our bounds require to be computable. This is the case for the popular Gaussian kernel with Gaussian base measure.
4.2 Nonspectral bounds on the likelihood
In this section, we derive bounds on the likelihood (14) that do not require to compute the Fredholm determinant .
Let , then
With the notation of Section 4, it holds
First note that has finite rank since for ,
Note also that the ’s are linearly independent since is a positive definite kernel. Now let be an orthonormal basis of , i.e. and
and define the matrix by
where denotes the inner product of . By definition of the Fredholm determinant for finite rank operators [20, Section VII.1 or Theorem VII.3.2], it comes
Applying the classical matrix determinant lemma, it comes
We finally remark that
We now apply the same argument as in the proof of the finite case (proof of Proposition 1). Denoting and , we know from the positive definiteness of the kernel that is positive semidefinite, which yields
5.1 A toy Gaussian continuous experiment
In this section, we consider a DPP on , so that the bounds derived in Section 4 apply. As in [7, Section 5.1], we take the base measure to be proportional to a Gaussian, i.e. its density is We consider a squared exponential kernel . In this particular case, the spectral decomposition of operator is known 444We follow the parametrization of  for ease of reference.
: the eigenfunctions ofare scaled Hermite polynomials, while the eigenvalues are a geometrically decreasing sequence. This 1D Gaussian-Gaussian example is interesting for two reasons: first, the spectral decomposition of is known, so that we can sample exactly from the corresponding DPP  and thus generate synthetic datasets. Second, the Fredholm determinant in this special case is a q-Pochhammer symbol, and can thus be efficiently computed555http://docs.sympy.org/latest/modules/mpmath/functions/qfunctions.html#q-pochhammer-symbol, which allows for comparison with “ideal” likelihood-based methods, to check the validity of our MCMC sampler, for instance. We emphasize that these special properties are not needed for the inference methods in Section 3, they are simply useful to demonstrate their correctness.
We sample a synthetic dataset using , resulting in points shown in red in Figure 1. Applying the variational inference method of Section 3.1, jointly optimizing in and using the CMA-ES optimizer , yields poorly consistent results: varies over several orders of magnitude from one run to the other, and relative errors for and go up to (not shown). We thus investigate the identifiability of the parameters with the retrospective MH of Section 3.2. To limit the range of , we choose for a wide uniform prior over
We use a Gaussian proposal, the covariance matrix of which is adapted on-the-fly  so as to reach of acceptance. We start each iteration with pseudo-inputs, and increase it by and re-optimize when the acceptance decision cannot be made. Most iterations could be made with , and the maximum number of inducing inputs required in our run was . We show the results of a run of length in Figure 1. Removing a burn-in sample of size , we show the resulting marginal histograms in Figures 1, 1, and 1. Retrospective MH and the ideal MH agree. The prior pdf is in green. The posterior marginals of and are centered around the values used for simulation, and are very different from the prior, showing that the likelihood contains information about and . However, as expected, almost nothing is learnt about , as posterior and prior roughly coincide. This is an example of the issues that come with parametrizing directly, as mentioned in Section 1.
To conclude, we show a set of optimized pseudo-inputs in black in Figure 1. We also superimpose the marginal of any single point in the realization, which is available through the spectral decomposition of here . In this particular case, this marginal is a Gaussian. Interestingly, the pseudo-inputs look like evenly spread samples from this marginal. Intuitively, they are likely to make the denominator in the likelihood (14) small, as they represent an ideal sample of the Gaussian-Gaussian DPP.
5.2 Diabetic neuropathy dataset
Here, we consider a real dataset of spatial patterns of nerve fibers in diabetic patients. These nerve fibers become more clustered as diabetes progresses . The dataset consists of seven samples collected from diabetic patients at different stages of diabetic neuropathy and one healthy subject. We follow the experimental setup used in  and we split the total samples into two classes: Normal/Mildly Diabetic and Moderately/Severely Diabetic. The first class contains three samples and the second one the remaining four. Figure 2 displays the point process data, which contain on average points per sample in the Normal/Mildly class and for the Moderately/Severely class. We wish to investigate the differences between these classes by fitting a separate DPP to each class and then quantify the differences of the repulsion or overdispersion of the point process data through the inferred kernel parameters. Paraphrasing , we consider a continuous DPP on , with kernel function
and base measure proportional to a Gaussian As in , we quantify the overdispersion of realizations of such a Gaussian-Gaussian DPP through the quantities , which are invariant to the scaling of . Note however that, strictly speaking, also mildly influences repulsion.
We investigate the ability of the variational method in Section 3.1 to perform approximate maximum likelihood training over the kernel parameters . Specifically, we wish to fit a separate continuous DPP to each class by jointly maximizing the variational lower bound over and the inducing inputs using gradient-based optimization. Given that the number of inducing variables determines the amount of the approximation, or compression of the DPP model, we examine different settings for this number and see whether the corresponding trained models provide similar estimates for the overdispersion measures. Thus, we train the DPPs under different approximations having inducing variables and display the estimated overdispersion measures in Figures 3 and 3. These estimated measures converge to coherent values as increases. They show a clear separation between the two classes, as also found in [7, 24]. Furthermore, Figures 3 and 3 show the values of the upper and lower bounds on the log likelihood, which as expected, converge to the same limit as increases. We point out that the overall optimization of the variational lower is relatively fast in our MATLAB implementation. For instance, it takes 24 minutes for the most expensive run where to perform iterations until convergence. Smaller values of yield significantly smaller times.
Finally, as in Section 5.1, we comment on the optimized pseudo-inputs. Figure 4 displays the inducing points at the end of a converged run of variational inference for various values of . Similarly to Figure 1, these pseudo-inputs are placed in remarkably neat grids and depart significantly from their initial locations.
We have proposed novel, cheap-to-evaluate, nonspectral bounds on the determinants arising in the likelihoods of DPPs, both finite and continuous. We have shown how to use these bounds to infer the parameters of a DPP, and demonstrated their use for expensive-but-exact MCMC and cheap-but-approximate variational inference. In particular, these bounds have some degree of freedom – the pseudo-inputs –, which we optimize so as to tighten the bounds. This optimization step is crucial for likelihood-based inference of parametric DPP models, where bounds have to adapt to the point where the likelihood is evaluated to yield decisions which are consistent with the ideal underlying algorithms. In future work, we plan to investigate connections of our bounds with the quadrature-based bounds for Fredholm determinants of. We also plan to consider variants of DPPs that condition on the number of points in the realization, to put joint priors over the within-class distributions of the features in classification problems, in a manner related to . In the long term, we will investigate connections between kernels and that could be made without spectral knowledge, to address the issue of replacing by .
The authors would like to thank Adrien Hardy for useful discussions and Emily Fox for kindly providing access to the diabetic neuropathy dataset.
Appendix A On the connection to variational sparse GPs
Here, we provide further details about how the variational lower bound for DPPs over finite sets in Proposition 1 can be obtained by the variational approach to sparse GPs . As mentioned in Section 3.1.1, it holds that
Taking logarithms yields
A lower bound to the likelihood (2) can thus be obtained if we bound
This has a similar functional form with the marginal likelihood in a standard GP regression model:
plays the role of an unnormalized Gaussian likelihood where the observation vector is equal to zero and the noise variance is equal to one. To lower bound the above we can consider the variational distributionand apply Jensen’s inequality so that
where the term cancels out inside the logarithm. This can be written as
Further, given that
where , the bound can be written as
Now if we analytically maximize w.r.t. , under the constraint that is a distribution, we obtain
Plugging this optimal back into the bound, we obtain
After computing the Gaussian integral w.r.t. , the r.h.s. reduces to the logarithm of the DPP bound for the finite case, see Proposition 1.
Appendix B matrix for Gaussian kernels
with Gaussian base measure having density
In this Gaussian-Gaussian case, the matrix defined in Proposition 2 can be analytically computed: the -th element is given by
-  D. J. Daley and D. Vere-Jones. An Introduction to the Theory of Point Processes. Springer, 2nd edition, 2003.
-  O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, 7:83–122, 1975.
-  A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 2012.
-  F. Lavancier, J. Møller, and E. Rubak. Determinantal point process models and statistical inference. Preprint, available as http://arxiv.org/abs/1205.4818, 2012.
-  J. B. Hough, M. Krishnapur, Y. Peres, and B. Virág. Determinantal processes and independence. Probability surveys, 2006.
-  J. Y. Zou and R. P. Adams. Priors for diversity in generative latent variable models. In Advances in Neural Information Processing Systems (NIPS), 2012.
-  R. H. Affandi, E. B. Fox, R. P. Adams, and B. Taskar. Learning the parameters of determinantal point processes. In Proceedings of the International Conference on Machine Learning (ICML), 2014.
-  J. Gillenwater, A. Kulesza, E. B. Fox, and B. Taskar. Expectation-maximization for learning determinantal point processes. In Advances in Neural Information Proccessing Systems (NIPS), 2014.
-  Z. Mariet and S. Sra. Fixed-point algorithms for learning determinantal point processes. In Advances in Neural Information systems (NIPS), 2015.
-  C. E. Rasmussen and C. K. I. Williams. Gaussian Processes for Machine Learning. MIT Press, 2006.
-  Michalis K. Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. In AISTATS, volume 5, 2009.
N. Cristianini and J. Shawe-Taylor.
Kernel methods for pattern recognition. Cambridge University Press, 2004.
R. H. Affandi, A. Kulesza, E. B. Fox, and B. Taskar.
Nyström approximation for large-scale determinantal processes.
Proceedings of the conference on Artificial Intelligence and Statistics (AISTATS), 2013.
-  E. Seiler and B. Simon. An inequality among determinants. Proceedings of the National Academy of Sciences, 1975.
The CMA evolution strategy: a comparing review.
In J.A. Lozano, P. Larranaga, I. Inza, and E. Bengoetxea, editors,
Towards a new evolutionary computation. Advances on estimation of distribution algorithms, pages 75–102. Springer, 2006.
-  Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. J. Mach. Learn. Res., 14(1):1303–1347, May 2013.
-  James Hensman, Nicoló Fusi, and Neil D. Lawrence. Gaussian processes for big data. In UAI, 2013.
-  C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer-Verlag, New York, 2004.
-  L. Devroye. Non-uniform random variate generation. Springer-Verlag, 1986.
-  I. Gohberg, S. Goldberg, and M. A. Kaashoek. Classes of linear operators, Volume I. Springer, 1990.
-  B. Simon. Trace ideals and their applications. American Mathematical Society, 2nd edition, 2005.
-  G. E. Fasshauer and M. J. McCourt. SIAM Journal on Scientific Computing, 34(2), 2012.
-  H. Haario, E. Saksman, and J. Tamminen. An adaptive Metropolis algorithm. Bernoulli, 7:223–242, 2001.
-  L. A. Waller, A. Särkkä, V. Olsbo, M. Myllymäki, I. G. Panoutsopoulou, W. R. Kennedy, and G. Wendelschafer-Crabb. Second-order spatial analysis of epidermal nerve fibers. Statistics in Medicine, 30(23):2827–2841, 2011.
-  F. Bornemann. On the numerical evaluation of Fredholm determinants. Mathematics of Computation, 79(270):871–915, 2010.