1 Introduction
Samples from a determinantal point process (DPP) [15] are sets of points that tend to be spread out. More specifically, given and a positive semidefinite kernel function
, the probability density of a point configuration
under a DPP with kernel is given by(1) 
where is the matrix with entries for each . The tendency for repulsion is captured by the determinant since it depends on the volume spanned by the selected points in the associated Hilbert space of . Intuitively, points similar according to or points that are nearly linearly dependent are less likely to be selected.
Building on the foundational work in [5] for the case where is discrete and finite, DPPs have been used in machine learning as a model for subset selection in which diverse sets are preferred [13, 2, 3, 9, 12]. These methods build on the tractability of sampling based on the algorithm of Hough et al. [10]
, which relies on the eigendecomposition of the kernel matrix to recursively sample points based on their projections onto the subspace spanned by the selected eigenvectors.
Repulsive point processes, like hard core processes [16, 7], many based on thinned Poisson processes and Gibbs/Markov distributions, have a long history in the spatial statistics community, where considering continuous is key. Many naturally occurring phenomena exhibit diversity—trees tend to grow in the least occupied space [17], ant hill locations are overdispersed relative to uniform placement [4] and the spatial distribution of nerve fibers is indicative of neuropathy, with hardcore processes providing a critical tool [25]. Repulsive processes on continuous spaces have garnered interest in machine learning as well, especially relating to generative mixture modeling [29, 18].
The computationally attractive properties of DPPs make them appealing to consider in these applications. On the surface, it seems that the eigendecomposition and projection algorithm of [10] for discrete DPPs would naturally extend to the continuous case. While this is true in a formal sense as becomes an operator instead of a matrix, the key steps such as the eigendecomposition of the kernel and projection of points on subspaces spanned by eigenfunctions are computationally infeasible except in a few very limited cases where approximations can be made [14]. The absence of a tractable DPP sampling algorithm for general kernels in continuous spaces has hindered progress in developing DPPbased models for repulsion.
In this paper, we propose an efficient algorithm to sample from DPPs in continuous spaces using lowrank approximations of the kernel function. We investigate two such schemes: Nyström and random Fourier features. Our approach utilizes a dual representation of the DPP, a technique that has proven useful in the discrete setting as well [11]. For DPPs, which only place positive probability on sets of cardinality [13], we also devise a Gibbs sampler that iteratively samples points in the set conditioned on all other points. The derivation relies on representing the conditional DPPs using the Schur complement of the kernel. Our methods allow us to handle a broad range of typical kernels and continuous subspaces, provided certain simple integrals of the kernel function can be computed efficiently. Decomposing our kernel into quality and similarity terms as in [13]
, this includes, but is not limited to, all cases where the (i) spectral density of the quality and (ii) characteristic function of the similarity kernel can be computed efficiently. Our methods scale well with dimension, in particular with complexity growing linearly in
.In Sec. 2, we review sampling algorithms for discrete DPPs and the challenges associated with sampling from continuous DPPs. We then propose continuous DPP sampling algorithms based on lowrank kernel approximations in Sec. 3 and Gibbs sampling in Sec. 4. An empirical analysis of the two schemes is provided in Sec. 5. Finally, we apply our methods to repulsive mixture modeling and human pose synthesis in Sec. 6 and 7.
2 Sampling from a DPP
When is discrete with cardinality , an efficient algorithm for sampling from a DPP is given in [10]. The algorithm, which is detailed in the supplement, uses an eigendecomposition of the kernel matrix and recursively samples points as follows, resulting in a set with :

Select eigenvector with probability . Let be the selected eigenvectors ().

For , sample points sequentially with probability based on the projection of onto the subspace spanned by . Once is sampled, update by excluding the subspace spanned by the projection of onto .
When
is discrete, both steps are straightforward since the first phase involves eigendecomposing a kernel matrix and the second phase involves sampling from discrete probability distributions based on inner products between points and eigenvectors. Extending this algorithm to a continuous space was considered by
[14], but for a very limited set of kernels and spaces . For general and , we face difficulties in both phases. Extending Phase 1 to a continuous space requires knowledge of the eigendecomposition of the kernel function. When is a compact rectangle in , [14] suggest approximating the eigendecomposition using an orthonormal Fourier basis.Even if we are able to obtain the eigendecomposition of the kernel function (either directly or via approximations as considered in [14] and Sec. 3), we still need to implement Phase 2 of the sampling algorithm. Whereas the discrete case only requires sampling from a discrete probability function, here we have to sample from a probability density. When is compact, [14] suggest using a rejection sampler with a uniform proposal on . The authors note that the acceptance rate of this rejection sampler decreases with the number of points sampled, making the method inefficient in sampling large sets from a DPP. In most other cases, implementing Phase 2 even via rejection sampling is infeasible since the target density is in general nonstandard with unknown normalization. Furthermore, a generic proposal distribution can yield extremely low acceptance rates.
In summary, current algorithms can sample approximately from a continuous DPP only for translationinvariant kernels defined on a compact space. In Sec. 3, we propose a sampling algorithm that allows us to sample approximately from DPPs for a wide range of kernels and spaces .
3 Sampling from a lowrank continuous DPP
Again considering discrete with cardinality , the sampling algorithm of Sec. 2 has complexity dominated by the eigendecomposition, . If the kernel matrix is lowrank, i.e. with a matrix and , [11] showed that the complexity of sampling can be reduced to . The basic idea is to exploit the fact that and the dual kernel matrix , which is
, share the same nonzero eigenvalues, and for each eigenvector
of , is the corresponding eigenvector of . See the supplement for algorithmic details.While the dependence on in the dual is sharply reduced, in continuous spaces, is infinite. In order to extend the algorithm, we must find efficient ways to compute for Phase 1 and manipulate eigenfunctions implicitly for the projections in Phase 2. Generically, consider sampling from a DPP on a continuous space with kernel where and are eigenvalues and eigenfunctions, and is the complex conjugate of . Assume that we can approximate by a lowdimensional (generally complexvalued) mapping, :
(2) 
Here, denotes complex conjugate transpose of . We consider two efficient lowrank approximation schemes in Sec. 3.1 and 3.2. Using such a lowrank representation, we propose an analog of the dual sampling algorithm for continuous spaces, described in Algorithm 1. A similar algorithm provides samples from a kDPP, which only gives positive probability to sets of a fixed cardinality [13]. The only change required is to the forloop in Phase 1 to select exactly eigenvectors using an efficient recursion. See the supplement for details.
In this dual view, we still have the same twophase structure, and must address two key challenges:

Assuming a lowrank kernel function decomposition as in Eq. (2), we need to able to compute the dual kernel matrix, given by an integral:
(3) 
In general, sampling directly from the density
is difficult; instead, we can compute the cumulative distribution function (CDF) and sample
using the inverse CDF method [21]:(4)
Assuming (i) the kernel function is finiterank and (ii) the terms and are computable, Algorithm 1 provides exact samples from a DPP with kernel . In what follows, approximations only arise from approximating general kernels with lowrank kernels . If given a finiterank kernel to begin with, the sampling procedure is exact.
One could imagine approximating as in Eq. (2) by simply truncating the eigendecomposition (either directly or using numerical approximations). However, this simple approximation for known decompositions does not necessarily yield a tractable sampler, because the products of eigenfunctions required in Eq. (3) might not be efficiently integrable. For our approximation algorithm to work, not only do we need methods that approximate the kernel function well, but also that enable us to solve Eq. (3) and (4) directly for many different kernel functions. We consider two such approaches that enable an efficient sampler for a wide range of kernels: Nyström and random Fourier features.
3.1 Sampling from RFFapproximated DPP
Random Fourier features (RFF) [19] is an approach for approximating shiftinvariant kernels,
, using randomly selected frequencies. The frequencies are sampled independently from the Fourier transform of the kernel function,
, and letting:(5) 
To apply RFFs, we factor into a quality function and similarity kernel (i.e., :
(6) 
The RFF approximation can be applied to cases where the similarity function has a known characteristic function, e.g., Gaussian, Laplacian and Cauchy. Using Eq. (5), we can approximate the similarity kernel function to obtain a lowrank kernel and dual matrix:
The CDF of the sampling distribution in Algorithm 1 is given by:
(7) 
where denotes the th element of vector . Note that equations and can be computed for many different combinations of and . In fact, this method works for any combination of (i) translationinvariant similarity kernel with known characteristic function and (ii) quality function with known spectral density. The resulting kernel need not be translation invariant. In the supplement, we illustrate this method by considering a common and important example where , is Gaussian, and is any kernel with known Fourier transform.
3.2 Sampling from a Nyströmapproximated DPP
Another approach to kernel approximation is the Nyström method [27]. In particular, given landmarks sampled from , we can approximate the kernel function and dual matrix as,
where . Denoting , the CDF of in Alg. 1 is:
(8) 
As with the RFF case, we consider a decomposition . Here, there are no translationinvariant requirements, even for the similarity kernel . In the supplement, we provide the important example where and both and are Gaussians and also when is polynomial, a case that cannot be handled by RFF since it is not translationally invariant.
4 Gibbs sampling
For DPPs, we can consider a Gibbs sampling scheme. In the supplement, we derive that the full conditional for the inclusion of point given the inclusion of the other points is a DPP with a modified kernel, which we know how to sample from. Let the kernel function be represented as before: . Denoting and the full conditional can be simplified using Schur’s determinantal equality [22]:
(9) 
In general, sampling directly from this full conditional is difficult. However, for a wide range of kernel functions, including those which can be handled by the Nyström approximation in Sec. 3.2, the CDF can be computed analytically and can be sampled using the inverse CDF method:
(10) 
In the supplement, we illustrate this method by considering the case where and and are Gaussians. We use this same Schur complement scheme for sampling from the full conditionals in the mixture model application of Sec. 6. A key advantage of this scheme for several types of kernels is that the complexity of sampling scales linearly with the number of dimensions making it suitable in handling highdimensional spaces.
As with any Gibbs sampling scheme, the mixing rate is dependent on the correlations between variables. In cases where the kernel introduces low repulsion we expect the Gibbs sampler to mix well, while in a high repulsion setting the sampler can mix slowly due to the strong dependencies between points and fact that we are only doing onepointatatime moves. We explore the dependence of convergence on repulsion strength in the supplementary materials. Regardless, this sampler provides a nice tool in the DPP setting. Asymptotically, theory suggests that we get exact (though correlated) samples from the DPP. To extend this approach to standard DPPs, we can first sample (this assumes knowledge of the eigenvalues of ) and then apply the above method to get a sample. This is fairly inefficient if many samples are needed. A more involved but potentially efficient approach is to consider a birthdeath sampling scheme where the size of the set can grow/shrink by 1 at every step.
5 Empirical analysis
To evaluate the performance of the RFF and Nyström approximations, we compute the total variational distance , where denotes the probability of set under a DPP with kernel , as given by Eq. (1). We restrict our analysis to the case where the quality function and similarity kernel are Gaussians with isotropic covariances and , respectively, enabling our analysis based on the easily computed eigenvalues [8]. We also focus on sampling from DPPs for which the size of the set is always . Details are in the supplement.
(a)  (b)  (c)  (d) 
Fig. 1 displays estimates of the total variational distance for the RFF and Nyström approximations when , varying (the repulsion strength) and the dimension . Note that the RFF method performs slightly worse as increases and is rather invariant to while the Nyström method performs much better for increasing but worse for increasing .
While this phenomenon seems perplexing at first, a study of the eigenvalues of the Gaussian kernel across dimensions sheds light on the rationale (see Fig. 1). Note that for fixed and , the decay of eigenvalues is slower in higher dimensions. It has been previously demonstrated that the Nyström method performs favorably in kernel learning tasks compared to RFF in cases where there is a large eigengap in the kernel matrix [28]. The plot of the eigenvalues seems to indicate the same phenomenon here. Furthermore, this result is consistent with the comparison of RFF to Nyström in approximating DPPs in the discrete case provided in [3].
This behavior can also be explained by looking at the theory behind these two approximations. For the RFF, while the kernel approximation is guaranteed to be an unbiased estimate of the true kernel elementwise, the variance is fairly high
[19]. In our case, we note that the RFF estimates of minors are biased because of nonlinearity in matrix entries, overestimating probabilities for point configurations that are more spread out, which leads to samples that are overlydispersed. For the Nyström method, on the other hand, the quality of the approximation depends on how well the landmarks cover . In our experiments the landmarks are sampled i.i.d. from . When either the similarity bandwidth is small or the dimension is high, the effective distance between points increases, thereby decreasing the accuracy of the approximation. Theoretical bounds for the Nyström DPP approximation in the case when is finite are provided in [3]. We believe the same result holds for continuous by extending the eigenvalues and spectral norm of the kernel matrix to operator eigenvalues and operator norms, respectively.In summary, for moderate values of it is generally good to use the Nyström approximation for lowdimensional settings and RFF for highdimensional settings.
6 Repulsive priors for mixture models
Mixture models are used in a wide range of applications from clustering to density estimation. A common issue with such models, especially in density estimation tasks, is the introduction of redundant, overlapping components that increase the complexity and reduce interpretability of the resulting model. This phenomenon is especially prominent when the number of samples is small. In a Bayesian setting, a common fix to this problem is to consider a sparse Dirichlet prior on the mixture weights, which penalizes the addition of nonzeroweight components. However, such approaches run the risk of inaccuracies in the parameter estimates [18]. Instead, [18] show that sampling the location parameters using repulsive priors leads to better separated clusters while maintaining the accuracy of the density estimate. They propose a class of repulsive priors that rely on explicitly defining a distance metric and the manner in which small distances are penalized. The resulting posterior computations can be fairly complex.
The theoretical properties of DPPs make them an appealing choice as a repulsive prior. In fact, [29] considered using DPPs as repulsive priors in latent variable models. However, in the absence of a feasible continuous DPP sampling algorithm, their method was restricted to performing MAP inference. Here we propose a fully generative probabilistic mixture model using a DPP prior for the location parameters, with a component model using a DPP.
In the common case of mixtures of Gaussians (MoG), our posterior computations can be performed using Gibbs sampling with nearly the same simplicity of the standard case where the location parameters are assumed to be i.i.d.. In particular, with the exception of updating the location parameters , our sampling steps are identical to standard MoG Gibbs updates in the uncollapsed setting. For the location parameters, instead of sampling each independently from its conditional posterior, our full conditional depends upon the other locations as well. Details are in the supplement, where we show that this full conditional has an interpretation as a single draw from a tilted DPP. As such, we can employ the Gibbs sampling scheme of Sec. 4.
We assess the clustering and density estimation performance of the DPPbased model on both synthetic and real datasets. In each case, we run 10,000 Gibbs iterations, discard 5,000 as burnin and thin the chain by 10. Hyperparameter settings are in the supplement. We randomly permute the labels in each iteration to ensure balanced label switching. Draws are postprocessed following the algorithm of
[23] to address the label switching issue.Synthetic data
To assess the role of the prior in a density estimation task, we generated a small sample of 100 observations from a mixture of two Gaussians. We consider two cases, the first with wellseparated components and the second with poorlyseparated components. We compare a mixture model with locations sampled i.i.d. (IID) to our DPP repulsive prior (DPP). In both cases, we set an upper bound of six mixture components. In Fig. 2, we see that both IID and DPP provide very similar density estimates. However, IID uses many largemass components to describe the density. As a measure of simplicity of the resulting density description, we compute the average entropy of the posterior mixture membership distribution, which is a reasonable metric given the similarity of the overall densities. Lower entropy indicates a more concise representation in an informationtheoretic sense. We also assess the accuracy of the density estimate by computing both (i) Hamming distance error relative to true cluster labels and (ii) heldout loglikelihood on 100 observations. The results are summarized in Table 1. We see that DPP results in (i) significantly lower entropy, (ii) lower overall clustering error, and (iii) statistically indistinguishable heldout loglikelihood. These results signify that we have a sparser representation with wellseparated (interpretable) clusters while maintaining the accuracy of the density estimate.
WellSep  PoorSep  Galaxy  Enzyme  Acidity 
DATASET  ENTROPY  CLUSTERING ERROR  HELDOUT LOGLIKE.  

IID  DPP  IID  DPP  IID  DPP  
Wellseparated  1.11 (0.3)  0.88 (0.2)  0.19 (0.1)  0.19 (0.1)  169 (6)  171(8) 
Poorlyseparated  1.46 (0.2)  0.92 (0.3)  0.47 (0.1)  0.39 (0.1)  211(10)  207(9) 
Real data
We also tested our DPP model on three real density estimation tasks considered in [20]: 82 measurements of velocity of galaxies diverging from our own (galaxy), acidity measurement of 155 lakes in Wisconsin (acidity), and the distribution of enzymatic activity in the blood of 245 individuals (enzyme). We once again judge the complexity of the density estimates using the posterior mixture membership entropy as a proxy. To assess the accuracy of the density estimates, we performed 5fold cross validation to estimate the predictive heldout loglikelihood. As with the synthetic data, we find that DPP visually results in better separated clusters (Fig. 2). The DPP entropy measure is also significantly lower for data that are not well separated (acidity and galaxy) while the differences in predictive loglikelihood estimates are not statistically significant (Table 2).
Finally, we consider a classification task based on the iris dataset: 150 observations from three iris species with four length measurements. For this dataset, there has been significant debate on the optimal number of clusters. While there are three species in the data, it is known that two have very low separation. Based on loss minimization, [24, 26] concluded that the optimal number of clusters was two. Table 2 compares the classification error using DPP and IID when we assume for evaluation the real data has three or two classes (by collapsing two lowseparation classes) , but consider a model with a maximum of six components. While both methods perform similarly for three classes, DPP has significantly lower classification error under the assumption of two classes, since DPP places large posterior mass on only two mixture components. This result hints at the possibility of using the DPP mixture model as a model selection method.


7 Generating diverse sample perturbations
We consider another possible application of continuousspace sampling. In many applications of inverse reinforcement learning or inverse optimal control, the learner is presented with control trajectories executed by an expert and tries to estimate a reward function that would approximately reproduce such policies
[1]. In order to estimate the reward function, the learner needs to compare the rewards of a large set of trajectories (or all, if possible), which becomes intractable in highdimensional spaces with complex nonlinear dynamics. A typical approximation is to use a set of perturbed expert trajectories as a comparison set, where a good set of trajectories should cover as large a part of the space as possible.We propose using DPPs to sample a largecoverage set of trajectories, in particular focusing on a human motion application where we assume a set of motion capture (MoCap) training data taken from the CMU database [6]. Here, our dimension is 62, corresponding to a set of joint angle measurements. For a given activity, such as dancing, we aim to select a reference pose and synthesize a set of diverse, perturbed poses. To achieve this, we build a kernel with Gaussian quality and similarity using covariances estimated from the training data associated with the activity. The Gaussian quality is centered about the selected reference pose and we synthesize new poses by sampling from our continuous DPP using the lowrank approximation scheme. In Fig. 3, we show an example of such DPPsynthesized poses. For the activity dance, to quantitatively assess our performance in covering the activity space, we compute a coverage rate metric based on a random sample of 50 poses from a DPP. For each training MoCap frame, we compute whether the frame has a neighbor in the DPP sample within an neighborhood. We compare our coverage to that of i.i.d. sampling from a multivariate Gaussian chosen to have variance matching our DPP sample. Despite favoring the i.i.d. case by inflating the variance to match the diverse DPP sample, the DPP poses still provide better average coverage over 100 runs. See Fig. 3 (right) for an assessment of the coverage metric. A visualization of the samples is in the supplement. Note that the i.i.d. case requires on average to cover all data whereas the DPP only requires . By , we cover over 90% of the data on average. Capturing the rare poses is extremely challenging with i.i.d. sampling, but the diversity encouraged by the DPP overcomes this issue.
8 Conclusion
Motivated by the recent successes of DPPbased subset modeling in finiteset applications and the growing interest in repulsive processes on continuous spaces, we considered methods by which continuousDPP sampling can be straightforwardly and efficiently approximated for a wide range of kernels. Our lowrank approach harnessed approximations provided by Nyström and random Fourier feature methods and then utilized a continuous dual DPP representation. The resulting approximate sampler garners the same efficiencies that led to the success of the DPP in the discrete case. One can use this method as a proposal distribution and correct for the approximations via MetropolisHastings, for example. For DPPs, we devised an exact Gibbs sampler that utilized the Schur complement representation. Finally, we demonstrated that continuousDPP sampling is useful both for repulsive mixture modeling (which utilizes the Gibbs sampling scheme) and in synthesizing diverse human poses (which we demonstrated with the lowrank approximation method). As we saw in the MoCap example, we can handle highdimensional spaces , with our computations scaling just linearly with . We believe this work opens up opportunities to use DPPs as parts of many models.
Acknowledgements: RHA and EBF were supported in part by AFOSR Grant FA95501210453 and DARPA Grant FA95501210406 negotiated by AFOSR. BT was partially supported by NSF CAREER Grant 1054215 and by STARnet, a Semiconductor Research Corporation program sponsored by MARCO and DARPA.
References
 Abbeel and Ng [2004] P. Abbeel and A.Y. Ng. Apprenticeship learning via inverse reinforcement learning. In Proc. ICML, 2004.
 Affandi et al. [2012] R. H. Affandi, A. Kulesza, and E. B. Fox. Markov determinantal point processes. In Proc. UAI, 2012.
 Affandi et al. [2013] R.H. Affandi, A. Kulesza, E.B. Fox, and B. Taskar. Nyström approximation for largescale determinantal processes. In Proc. AISTATS, 2013.
 Bernstein and Gobbel [1979] R. A. Bernstein and M. Gobbel. Partitioning of space in communities of ants. Journal of Animal Ecology, 48(3):931–942, 1979.
 Borodin and Rains [2005] A. Borodin and E.M. Rains. EynardMehta theorem, Schur process, and their Pfaffian analogs. Journal of statistical physics, 121(3):291–317, 2005.
 CMU [2009] CMU. Carnegie Mellon University graphics lab motion capture database. http://mocap.cs.cmu.edu/, 2009.
 Daley and VereJones [2003] D.J. Daley and D. VereJones. An introduction to the theory of point processes: Volume I: Elementary theory and methods. Springer, 2003.

Fasshauer and McCourt [2012]
G.E. Fasshauer and M.J. McCourt.
Stable evaluation of Gaussian radial basis function interpolants.
SIAM Journal on Scientific Computing, 34(2):737–762, 2012.  Gillenwater et al. [2012] J. Gillenwater, A. Kulesza, and B. Taskar. Discovering diverse and salient threads in document collections. In Proc. EMNLP, 2012.
 Hough et al. [2006] J.B. Hough, M. Krishnapur, Y. Peres, and B. Virág. Determinantal processes and independence. Probability Surveys, 3:206–229, 2006.
 Kulesza and Taskar [2010] A. Kulesza and B. Taskar. Structured determinantal point processes. In Proc. NIPS, 2010.
 Kulesza and Taskar [2011] A. Kulesza and B. Taskar. kDPPs: Fixedsize determinantal point processes. In ICML, 2011.
 Kulesza and Taskar [2012] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2–3), 2012.
 Lavancier et al. [2012] F. Lavancier, J. Møller, and E. Rubak. Statistical aspects of determinantal point processes. arXiv preprint arXiv:1205.4818, 2012.
 Macchi [1975] O. Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, pages 83–122, 1975.
 Matérn [1986] B. Matérn. Spatial variation. SpringerVerlag, 1986.
 Neeff et al. [2005] T. Neeff, G. S. Biging, L. V. Dutra, C. C. Freitas, and J. R. Dos Santos. Markov point processes for modeling of spatial forest patterns in Amazonia derived from interferometric height. Remote Sensing of Environment, 97(4):484–494, 2005.
 Petralia et al. [2012] F. Petralia, V. Rao, and D. Dunson. Repulsive mixtures. In NIPS, 2012.
 Rahimi and Recht [2007] A. Rahimi and B. Recht. Random features for largescale kernel machines. NIPS, 2007.
 Richardson and Green [1997] S. Richardson and P. J. Green. On Bayesian analysis of mixtures with an unknown number of components (with discussion). JRSS:B, 59(4):731–792, 1997.
 Robert and Casella [2004] C.P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2nd edition, 2004.
 Schur [1917] J Schur. Über potenzreihen, die im innern des einheitskreises beschränkt sind. Journal für die reine und angewandte Mathematik, 147:205–232, 1917.
 Stephens [2000] M. Stephens. Dealing with label switching in mixture models. JRSS:B, 62(4):795–809, 2000.
 Sugar and James [2003] C.A. Sugar and G.M. James. Finding the number of clusters in a dataset: An informationtheoretic approach. JASA, 98(463):750–763, 2003.
 Waller et al. [2011] L. A. Waller, A. Särkkä, V. Olsbo, M. Myllymäki, I.G. Panoutsopoulou, W.R. Kennedy, and G. WendelschaferCrabb. Secondorder spatial analysis of epidermal nerve fibers. Statistics in Medicine, 30(23):2827–2841, 2011.
 Wang [2010] J. Wang. Consistent selection of the number of clusters via crossvalidation. Biometrika, 97(4):893–904, 2010.
 Williams and Seeger [2000] C.K.I. Williams and M. Seeger. Using the Nyström method to speed up kernel machines. NIPS, 2000.
 Yang et al. [2012] T. Yang, Y.F. Li, M. Mahdavi, R. Jin, and Z.H. Zhou. Nyström method vs random fourier features: A theoretical and empirical comparison. NIPS, 2012.
 Zou and Adams [2012] J. Zou and R.P. Adams. Priors for diversity in generative latent variable models. In NIPS, 2012.
Appendix A Dpp, DPP, and dual DPP sampling
For discrete and finite with cardinality , we provide the algorithms for sampling from DPPs, DPPs, and DPPs via the dual representation in Algorithms 2, 3, 4. In the DPP sampler, denotes the th elementary symmetric polynomial. For continuous, we provide the continuous DPP dual sampler in Algorithm 5. Note that the only difference relative to the DPP dual sampler is in the for loop of Phase 1. The revision exactly parallels the story for the discrete case.
Appendix B Derivation of the Gibbs sampling scheme
For a DPP, the probability of choosing a specific point configuration is given by
(11) 
Denoting and , the Schur’s determinantal identity formula yields
(12) 
Conditioning on the inclusion of the other points, and suppressing constants not dependent on we can now write the conditional distribution as
(13) 
Normalizing and integrating this density yields a full conditional CDF given by
(14) 
Appendix C Overview of analytically tractable kernel types under RFF or Nyström
Sampling from a DPP with kernel using Algorithm 1 of the main paper requires that (i) we can compute a lowrank decomposition of and (ii) the terms and are computable. In the main paper, we consider a decomposition of where is a quality function and a similarity kernel. We then use either random Fourier features (RFF) or the Nyström method to approximate with . In general, we can consider RFF approximations whenever the spectral density of and characteristic function of are known. For Nyström, the statement is not quite as clear. Instead, we provide a list of standard choices and their associated feasibilities for DPP sampling in Table 3. The list is by no means exhaustive, but is simply to provide some insight. We also elaborate upon some standard kernels in the following sections.
Method  

Gaussian, Laplacian  Gaussian, Laplacian  Nyström  ✓ 
RFF  ✓  
Gibbs  ✓  
Gaussian, Laplacian  Cauchy  Nyström  ? 
RFF  ✓  
Gibbs  ?  
Cauchy  Gaussian, Laplacian  Nyström  ? 
RFF  ✓  
Gibbs  ?  
Cauchy  Cauchy  Nyström  ? 
RFF  ✓  
Gibbs  ?  
Gaussian, Laplacian  Linear, Polynomial  Nyström  ✓ 
RFF  X  
Gibbs  ✓ 
Example: Sampling from RFFapproximated DPP with Gaussian quality
Assuming and is given by a translationinvariant kernel with known characteristic function. We start by sampling . Note, for example, that the Fourier transform of a Gaussian kernel is a Gaussian while that of the Laplacian is Cauchy and vice versa. The approximated kernel is given by
(15) 
The elements of the dual matrix are then given by
(16) 
Letting be the spectral decompostition of with , and , one can straightforwardly derive:
(17) 
Likewise,
(18) 
where
Once samples are obtained, we transform back into our original coordinate system by letting .
Example: Sampling from Nyströmapproximated DPP with Gaussian quality and similarity
Assuming and , the approximated kernel is given by
(19) 
Let with , with and with . Furthermore, let , and . Then, the elements of the dual matrix are then given by
(20) 
where
Finally, the CDF of is given by
(21) 
Once samples are obtained, we transform back to our original coordinate system by letting .
Example: Sampling from Nyströmapproximated DPP with Gaussian quality and polynomial similarity
For simplicity of exposition, we consider a linear similarity kernel and , although the result can straightforwardly be extended to higher order polynomials and dimensions . Assuming and , the approximated kernel is given by
(22) 
The elements of the dual matrix are then given by
(23) 
The CDF is given by
(24) 
Example: Gibbs sampling with Gaussian quality and similarity
For generic kernels , we recall that the CDF of given for a DPP is given by
(25) 
Assuming and