Flexible Modeling of Diversity with Strongly Log-Concave Distributions

06/12/2019
by   Joshua Robinson, et al.
MIT
4

Strongly log-concave (SLC) distributions are a rich class of discrete probability distributions over subsets of some ground set. They are strictly more general than strongly Rayleigh (SR) distributions such as the well-known determinantal point process. While SR distributions offer elegant models of diversity, they lack an easy control over how they express diversity. We propose SLC as the right extension of SR that enables easier, more intuitive control over diversity, illustrating this via examples of practical importance. We develop two fundamental tools needed to apply SLC distributions to learning and inference: sampling and mode finding. For sampling we develop an MCMC sampler and give theoretical mixing time bounds. For mode finding, we establish a weak log-submodularity property for SLC functions and derive optimization guarantees for a distorted greedy algorithm.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

05/29/2021

The query complexity of sampling from strongly log-concave distributions in one dimension

We establish the first tight lower bound of Ω(loglogκ) on the query comp...
08/02/2016

Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling

We study probability measures induced by set functions with constraints....
12/21/2020

Complexity of zigzag sampling algorithm for strongly log-concave distributions

We study the computational complexity of zigzag sampling algorithm for s...
07/02/2020

Double-Loop Unadjusted Langevin Algorithm

A well-known first-order method for sampling from log-concave probabilit...
10/21/2020

Optimal dual quantizers of 1D log-concave distributions: uniqueness and Lloyd like algorithm

We establish for dual quantization the counterpart of Kieffer's uniquene...
04/04/2022

Scalable random number generation for truncated log-concave distributions

Inverse transform sampling is an exceptionally general method to generat...
09/14/2021

Domain Sparsification of Discrete Distributions using Entropic Independence

We present a framework for speeding up the time it takes to sample from ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A variety of machine learning tasks involve selecting diverse subsets of items. How we model diversity is, therefore, a key concern with possibly far-reaching consequences. Recently popular probabilisitic models of diversity include determinantal point processes 

[31, 37], and more generally, strongly Rayleigh (SR) distributions [8, 34]. These models have been successfully deployed for subset selection in applications such as video summarization [42], fairness [13], model compression [44]

, anomaly detection 

[48], the Nyström method [39], generative models [24, 38], and accelerated coordinate descent [49]. While valuable and broadly applicable, SR distributions have one main drawback: it is difficult to control the strength and nature of diversity they model.

We counter this drawback by leveraging strongly log-concave (SLC) distributions [3, 4, 5]. These distributions are strictly more general than SR measures, and possess key properties that enable easier, more intuitive control over diversity. They derive their name from SLC polynomials introduced by Gurvits already a decade ago [29]. More recently they have shot into prominence due to their key role in developing deep connections between discrete and continuous conxevity, with subsequent applications in combinatorics [1, 10, 32]. In particular, they lie at the heart of recent breakthrough results such as a proof of Mason’s conjecture [4] and obtaining a fully polynomial-time approximation scheme for counting the number of bases of arbitrary matroids [3, 5]. We remark that all these works assume homogeneous SLC polynomials.

We build on this progress to develop fundamental tools for general SLC distributions, namely, sampling and mode finding. We highlight the flexibility of SLC distributions through two settings of importance in practice: (i) raising any SLC distribution to a power ; and (ii) incorporating a constraint that allows sampling sets of any size up to a budget. In contrast to similar modifications to SR measures (see e.g., [48]), these settings retain the crucial SLC property. Setting (i) allows us to conveniently tune the strength of diversity by varying a single parameter; while setting (ii) offers greater flexibility than fixed cardinality distributions such as a -determinantal point process [36]. This observation is simple yet important, especially since the “right” value of is hard to fix a priori.

Contributions.

We briefly summarize the main contributions of this work below.

  • We introduce the class of strongly log-concave distributions to the machine learning community, showing how it can offer a flexible discrete probabilistic model for distributions over subsets.

  • We prove various closure properties of SLC distributions (Theorems 2-5), and show how to use these properties for better controlling the distributions used for inference.

  • We derive sampling algorithms for SLC and related distributions, and analyze their corresponding mixing times both theoretically and empirically (Algorithm 1, Theorem 8).

  • We study the negative dependence of SLC distributions by deriving a weak log-submodularity property (Theorem 10). Optimization guarantees for a selection of greedy algorithms are obtained as a consequence (Theorem 11).

As noted above, our results build on the remarkable recent progress in [3, 4, 5] and [10]. The biggest difference between the previous work and this work is our focus on general non-homogeneous SLC polynomials, corresponding to distributions over sets of varying cardinality, as opposed to purely the homogeneous, i.e., fixed-cardinality, case. This broader focus necessitates development of some new machinery, because unlike SR polynomials, the class of SLC polynomials is not closed under homogenization. We summarize the related work below for additional context.

1.1 Related work

SR polynomials. Strongly Rayleigh distributions were introduced in [8] as a class of discrete distributions possessing several strong negative dependence properties. It did not take long for their potential in machine learning to be identified [37]. Particular attention has been paid to determinantal point processes due to the intuitive way they capture negative dependence, and the fact that they are parameterized by a single positive semi-definite kernel matrix. Convenient parameterization has allowed an abundance of fast algorithms for learning the kernel matrix [23, 26, 43, 47], and sampling [2, 40, 46]. SR distributions are a fascinating and elegant probabilistic family whose applicability in machine learning is still an emerging topic [17, 34, 41, 45].

SLC polynomials. Gurvits introduced SLC polynomials a decade ago [29] and studied their connection to discrete convex geometry. Recently this connection was significantly developed [10, 5] by establishing that matroids, and more generally M-convex sets, are characterized by the strong log-concavity of their generating polynomial. This is in contrast to SR, for which it is known that some matroids have generating polynomials that are not SR [9].

Log-Submodular Distributions. Distributions over subsets that are log-submodular (or supermodular) are amenable to mode finding and variational inference with approximation guarantees, by exploiting the optimization properties of submodular functions [20, 21, 22]. Theoretical bounds on sampling time require additional assumptions [28]. Iyer and Bilmes [33] analyze inference for submodular distributions, establishing polynomial approximation bounds.

MCMC samplers and mixing time. The seminal works [18, 19]

offer two tools for obtaining mixing time bounds for Markov chains: lower bounding the spectral gap, or log-Sobolev constant. These techniques have been successfully deployed to obtain mixing time bounds for homogenous SR distributions

[2], general SR distributions [40], and recently homogenous SLC distributions [5].

2 Background and setup

Notation.

We write , and denote by the power set . For any variable , write to denote ; in case , we often abbreviate further by writing instead of . For and let

denote the binary indicator vector of

, and define . We also write variously and where means we do not take any derivatives with respect to . We let and denote the monomials and respectively. For or we write to denote the set of all polynomials in the variables whose coefficients belong to . A polynomial is said to be -homogeneous if it is the sum of monomials all of which are of degree . Finally, for a set we shall minimize clutter by using and to denote and respectively.

SLC distributions.

We consider distributions on the subsets of a ground set . There is a one-to-one correspondence between such distributions, and their generating polynomials

(1)

The central object of interest in this paper is the class of strongly log-concave distributions, which is defined by imposing certain log-concavity requirements on the corresponding generating polynomials.

Definition 1.

A polynomial is strongly log-concave (SLC) if every derivative of is log-concave. That is, for any either , or the function is concave at all . We say a distribution is strongly log-concave if its generating polynomial is strongly log-concave; we also say is -homogeneous if is -homogeneous.

There are many examples of SLC distributions; we note a few important ones below.

  • Determinantal point processes [37, 27, 36, 39], and more generally, Strongly Rayleigh (SR) distributions [8, 17, 41, 34].

  • Exponentiated (for exponents in ) homogeneous SR distributions [48, 5].

  • The uniform distribution on the independent sets of a matroid

    [4].

SR distributions satisfy several strong negative dependence properties (e.g., log-submodularity and negative association). The fact that SLC is a strict superset of SR suggests that SLC distributions possess some weaker negative dependence properties. These properties will play a crucial role in the two fundamental tasks that we study in this paper: sampling and mode finding.

Sampling.

Our first task is to efficiently draw samples from an SLC distribution . To that end, we seek to develop Markov Chain Monte Carlo (MCMC) samplers whose mixing time (see Section 4 for definition) can be well-controlled. For homogeneous , the breakthrough work of Anari et al. [5] provides the first analysis of fast-mixing for a simple Markov chain called Base Exchange Walk; this analysis is further refined in [15]. Base Exchange Walk is defined as follows: if currently at state , remove an element uniformly at random. Then move to with probability proportional to . This describes a transition kernel for moving from to . We build on these works to obtain the first mixing time bounds for sampling from general (i.e., not necessarily homogeneous) SLC distributions (Section 4).

Mode finding.

Our second main goal is optimization, where we consider the more general task of finding a mode of an SLC distribution subject to a cardinality constraint. This task involves solving . This task is known to be NP-hard even for SR distributions; indeed, the maximum volume subdeterminant problem [14] is a special case.We consider a more practical approach based on observing that SLC distributions satisfy a relaxed notion of log-submodularity, which enables us to adapt simple greedy algorithms. Before presenting the details about sampling and optimization, we need to first establish some key theoretical properties of general SLC distributions. This is the subject of the next section.

3 Theoretical tools for general SLC polynomials

In this technical section we develop the theory of strong log-concavity by detailing several transformations of an SLC polynomial that preserve strong log-concavity. Such closure properties can be essential for proving the SLC property, or for developing algorithmic results. Due to the correspondence between distributions on and their generating polynomials, each statement concerning polynomials can be translated into a statement about probability distributions. The following theorem is a crucial stepping stone to sampling from non-homogeneous SLC distributions, and to sampling with cardinality constraints.

Theorem 2.

Let be SLC, and suppose the support of the sum is the collection of independent sets of a rank matroid. Then for any the following polynomial is SLC:

The above operation is also referred to as scaled homogenization, since the resulting polynomial is homogeneous and there is an added factor. In fact, we may extend Theorem 2 to allowing the user to add an additional exponentiating factor:

Theorem 3.

Let be SLC, and suppose the support of the sum is the collection of independent sets of a rank matroid. Then for and any the following polynomial is SLC:

Notably, Theorem 3 fails for all . For a proof of this see Appendix A.2.

Next, we show that polarization preserves strong log-concavity. Polarization essentially means to replace a variable with a higher power by multiple “copies”, each occurring only with power one, in a way that the resulting polynomial is symmetric (or permutation-invariant) in those copies. This is achieved by averaging over elementary symmetric polynomials. Formally, the polarization of the polynomial is defined to be

where is the th elementary symmetric polynomial in variables. The polarization has the following three properties:

  1. It is symmetric in the variables ;

  2. Setting recovers ;

  3. is multiaffine, and hence the generating polynomial of a distribution on .

Closure under polarization, combined with the homogenization results (Theorems 2 and 3) allows non-homogeneous distributions to be transformed into homogenous ones. This allows general SLC distributions to be transformed into homogenous SLC distributions for which fast mixing results are known [5]. How to work backwards to obtain samples from the original distribution will be the topic of the next section.

Theorem 4.
111This result was independently discovered by Brändén and Huh [10].

Let be SLC, and the support of the sum is the collection of independent sets of a rank matroid. Then the polarization is SLC.

Putting all of the preceding results together we obtain the following important corollary. It is this observation that will allow us to do mode finding for SLC distributions and exponentiated, cardinality constrained SLC distributions.

Corollary 5.

Let be SLC, and suppose the support of the sum is the collection of independent sets of a rank matroid. Then is SLC for any and .

In Appendix A.4 we also show that SLC distributions are closed under conditioning on a fixed set size. We mention those results since they may be of independent interest, but omit them from the main text since we do not use them further in this paper.

4 Sampling from strongly log-concave distributions

In this section we outline how to use the SLC closure results from Section 3 to build a sampling algorithm for general SLC distributions and prove mixing time bounds. Recall that we are considering a probability distribution that is strongly log-concave. The mixing time of a Markov chain started at is where is the -step transition kernel. For the remainder of this section we consider the distribution where for , and . In particular, this includes itself. The power allows to vary the degree of diversity induced by the distribution.

Our strategy is as follows: we first “extend” to a distribution over subsets of size of to obtain a homogeneous distribution. If we can sample from , then we can extract a sample of a scaled version ov by simply restricting a sample to . If was SR, then would also be SR, and a fast sampler follows from this observation [40]. But, for general SLC distributions (and their powers), is not SLC, and deriving a sampler is more challenging.

To still enable the homogenization strategy, we instead derive a carefully scaled version of a homogeneous version of that, as we prove, is homogeneneous and SLC and hence tractable. We use this rescaled version as a proposal distribution in a sampler for .

To obtain an appropriately scaled extended, homogeneous variant , we first translate Corollary 5 into probabilistic language.

Theorem 6.

Suppose that the support of the sum in the generating polynomial of is the collection of independent sets of a rank matroid. Then for any the following probability distribution on is SLC:

Proof.

Observe that the generating polynomial of is where denotes the generating polynomial of . The result follows immediately from Corollary 5. ∎

The ultimate proposal that we use is not , but a modified version that better aligns with :

Proposition 7.

If is SLC, then is SLC.

Proof.

Lemma 39

in the Appendix says that strong log-concavity is preserved under linear transformations of the coordinates. This implies that

is SLC since its generating polynomial is where is the generating polynomial of and is the linear transform defined by: and for . ∎

Importantly, since is homogeneous and SLC, the Base Exchange Walk for mixes rapidly. Let denote the Markov transition kernel for Base Exchange Walk on for . We use as a proposal, and then compute the appropriate acceptance probability to obtain a chain that mixes to the symmetric homogenization of . The target is a -homogenous distribution on :

A crucial property of

is that its marginalization over the “dummy” variables yields

, i.e., . Therefore, after obtaining a sample one then obtains a sample from by computing .

1:Initialize
2:while not mixed do
3:     Set
4:     Propose move
5:     if   then
6:          with probability , otherwise stay at      
7:     if   then
8:               
9:     if   then
10:          with probability , otherwise stay at      
Algorithm 1 Metropolis-Hastings sampler for with proposal

It is a simple computation to show that the acceptance probabilities in Algorithm 1 are indeed the Metropolis-Hastings acceptance probabilities for sampling from using the proposal . Therefore the chain mixes to . We obtain the following mixing time bound, recalling that the mixing time of is .

Theorem 8.

For the mixing time of the chain in Algorithm 1 started at satisfies the bound

5 Maximization of weakly log-submodular functions

In this section we explore the negative dependence properties of SLC functions (unnormalized SLC distributions). To do this we introduce a new notion of weak submodularity. Then we show that any function such that is SLC is weak log-submodular. In particular, this includes all examples discussed above. Finally, we prove that a distorted greedy optimization procedure leads to optimization guarantees for weak (log-)submodular functions for the cardinality constrained problem . Appendix C contains similar results for constrained greedy optimization of increasing weak (log-)submodular functions and unconstrained double greedy optimization of non-negative (log-)submodular functions.

Definition 9.

We call a function -weakly submodular if for any and with and not equal, we have

We say is -weakly log-submodular if is -weakly submodular.

Note carefully that our notion of weak submodularity differs from a notion of weak submodularity that already appears in the literature [16, 30, 35]. Building on a result by Brändén and Huh [10], we prove the following result.

Theorem 10.

Any non-negative function with support contained in and generating polynomial such that is strongly log-concave is -weakly log-submodular for .

This result, whilst weaker than log-submodularity, gives a path to optimizing strongly log-concave functions. Consider , assumed to be -weakly submodular. Note in particular we do not assume that is non-negative. This is important since we are interested in applying this procedure to the logarithm of a distribution, which need not be non-negative. Define , and . We use the convention that . Then we may decompose where . Note that is -weakly submodular and is a non-negative function.

We will extend the distorted greedy algorithm by [25, 30] to our notion of weak submodularity. To do so, we introduce the distorted objective for . The distorted greedy algorithm greedily builds a set of size at most by forming a sequence such that is formed by adding the element to that maximizes so long as the increment is positive.

1:Let
2:for   do
3:     Set
4:     if  then
5:         
6:     else      
7:return
Algorithm 2 Distorted greedy weak submodular constrained maximization of
Theorem 11.

Suppose is -weakly submodular and . Then the solution obtained by the distorted greedy algorithm satisfies

where .

Note any weakly submodular function can be brought into the required form by subtracting if it is non-zero. If is weakly log-submodular, we can decompose such that and perform the same role as and did in the weakly submodular setting. Then by appling Theorem 11 to we obtain the following corollary.

Corollary 12.

Suppose is -weakly log-submodular and . Then the solution obtained by the distorted greedy algorithm satisfies

6 Experiments

In this section we empirically evaluate the mixing time of Algorithm 1. We use the standard potential scale reduction factor metric to measure convergence to the stationary distribution [11]

. The method involves running several chains in parallel and computing the average variance within each chain and between the chains. The PSRF score is the ratio of the between variance over the within variance and is usually above

. When the PSRF score is close to then the chains are considered to be mixed. In all of our experiments we run three chains in parallel and declare them to be mixed once the PSRF score drops below .

(a)
(b)
Figure 1: Empirical mixing time analysis for sampling a ground set of size and various cardinality constraints , (a) the PSRF score for each set of chains, (b) the approximate mixing time obtained by thresholding at PSRF equal to .
(a)
(b)
(c)
Figure 2: (a,b) Empirical mixing time analysis for sampling a set of size at most for varying ground set sizes, (a) the PSRF score for each set of chains, (b) the approximate mixing time obtained by thresholding at PSRF equal to , (c) comparison of Algorithm 1 and a M-H algorithm where the proposal is built using : and .

Figure 1 considers the results of running the Metropolis-Hastings algorithm on a sequence of problems with different cardinality constraints . In each case we considered the distribution where is a randomly generated PSD matrix. Here denotes the submatrix of whose indices belong to . These simulations suggest that the mixing time grows linearly in for a fixed .

Figure 2 considers the results of running the Metropolis-Hastings algorithm on a sequence of problems with different ground set sizes. In each case we considered the distribution where is a randomly generated PSD matrix where of appropriate size . These simulations suggest that the mixing time grows sublinearly in for a fixed .

It is important to know whether the mixing time is robust to different spectra of . We consider three cases, (i) smooth decay

, (ii) a single large eigenvalue

, and (iii) one fifth of the eigenvalues are equal to , the rest equal to . Note that due to normalization, multiplying the spectrum by a constant does not affect the resulting distribution. The results for (i) are the content of Figures 1 and 2 (a,b). Figures 3 and 4 show the results for (ii) and figures 5 and 6 show the results for (iii). Figures 3-6 can be found in Appendix D.

Finally, we address the question of why the proposal distribution was built using the particular choice of we made. Indeed one may use Base Exchange Walk for any homogenous distribution on to build a sampler, one simply needs to compute the appropriate acceptance probabilities. We restrict our attention to SLC distributions so as to be able to build on the recent mixing time results for homogenous SLC distributions. An obvious alternative to using to build the proposal is to use . Figure 2(c) compares the empirical mixing time of these two chains. The strong empirical improvement justifies our choice of adding the extra rescaling factor .

7 Discussion

In this paper we introduced strongly log-concave distributions as a promising class of models for diversity. They have flexibility beyond that of strongly Rayleigh distributions, e.g., via exponentiated and cardinality constrained distributions (which do not preserve the SR property). We derived a suite of MCMC samplers for general SLC distributions and associated mixing time bounds. For optimization, we showed that SLC distributions satisfy a weak submodularity property and proved mode finding guarantees.

Still, many open problems remain. Although the mixing time bound has the interesting property of not directly depending on , the dependence seems quite conservative compared to the empirical mixing time results. An important future direction would be to bridge this gap. More fundamentally, the negative dependence properties of SLC distributions need to be explored in greater detail. Although in this work we proved a weak submodularity property for SLC distributions, we know of no examples of SLC distributions that are not log-submodular in the usual strong sense. This leads to the following conjecture, which if true would lead to stronger optimization guarantees.

Conjecture 13.

All strongly log-concave distributions are log-submodular.

Finally, in order for SLC models to be deployed in practice the user needs a way to learn a good SLC model from data. Both exponentiation and cardinality constraint add a single parameter that must be learned. We leave the question of how best to learn these parameters as an important topic for future work.

References

  • Adiprasito et al. [2018] Karim Adiprasito, June Huh, and Eric Katz. Hodge theory for combinatorial geometries. Annals of Mathematics, 188(2):381–452, 2018.
  • Anari et al. [2016] Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In Conference on Learning Theory, pages 103–115, 2016.
  • Anari et al. [2018a] Nima Anari, Shayan Oveis Gharan, and Cynthia Vinzant. Log-concave polynomials, entropy, and a deterministic approximation algorithm for counting bases of matroids. In Annual Symposium on Foundations of Computer Science, pages 35–46. IEEE, 2018a.
  • Anari et al. [2018b] Nima Anari, Kuikui Liu, Shayan Oveis Gharan, and Cynthia Vinzant. Log-Concave Polynomials III: Mason’s Ultra-Log-Concavity Conjecture for Independent Sets of Matroids. arXiv:1811.01600, 2018b.
  • Anari et al. [2019] Nima Anari, Kuikui Liu, Shayan Oveis Gharan, and Cynthia Vinzant. Log-Concave Polynomials II: High-Dimensional Walks and an FPRAS for Counting Bases of a Matroid. In

    Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing

    . ACM, June 2019.
  • Bapat et al. [1997] Ravi B Bapat, Ravindra B Bapat, and Raghavan. Nonnegative matrices and applications, volume 64. Cambridge University Press, 1997.
  • Berg et al. [1984] Christian Berg, Jens Peter Reus Christensen, and Paul Ressel. Harmonic analysis on semigroups: theory of positive definite and related functions, volume 100. Springer, 1984.
  • Borcea et al. [2009] Julius Borcea, Petter Brändén, and Thomas Liggett. Negative Dependence and the Geometry of Polynomials. Journal of the American Mathematical Society, 22(2):521–567, 2009.
  • Brändén [2007] Petter Brändén. Polynomials with the half-plane property and matroid theory. Advances in Mathematics, 216(1):302–320, 2007.
  • Brändén and Huh [2019] Petter Brändén and June Huh. Lorentzian polynomials. arXiv:1902.03719, 2019.
  • Brooks and Gelman [1998] Stephen P Brooks and Andrew Gelman. General Methods for Monitoring Convergence of Iterative Simulations. Journal of computational and graphical statistics, 7(4):434–455, 1998.
  • Buchbinder et al. [2015] Niv Buchbinder, Moran Feldman, Joseph Seffi, and Roy Schwartz. A Tight Linear Time (1/2)-Approximation for Unconstrained Submodular Maximization. SIAM Journal on Computing, 44(5):1384–1402, 2015.
  • Celis et al. [2018] L Elisa Celis, Vijay Keswani, Damian Straszak, Amit Deshpande, Tarun Kathuria, and Nisheeth K Vishnoi. Fair and diverse DPP-based data summarization. arXiv:1802.04023, 2018.
  • Civril and Magdon-Ismail [2013] Ali Civril and Malik Magdon-Ismail. Exponential inapproximability of selecting a maximum volume sub-matrix. Algorithmica, 65(1):159–176, 2013.
  • Cryan et al. [2019] Mary Cryan, Heng Guo, and Giorgos Mousa. Modified log-Sobolev inequalities for strongly log-concave distributions. arXiv:1903.06081, 2019.
  • Das and Kempe [2011] Abhimanyu Das and David Kempe. Submodular meets spectral: Greedy algorithms for subset selection, sparse approximation and dictionary selection. In International Conference on Machine Learning, 2011.
  • Derezinski and Warmuth [2017] Michal Derezinski and Manfred K Warmuth. Unbiased Estimates for Linear Regression via Volume Sampling. In Advances in Neural Information Processing Systems, pages 3084–3093, 2017.
  • Diaconis et al. [1991] Persi Diaconis, Daniel Stroock, et al. Geometric Bounds for Eigenvalues of Markov Chains. The Annals of Applied Probability, 1(1):36–61, 1991.
  • Diaconis et al. [1996] Persi Diaconis, Laurent Saloff-Coste, et al. Logarithmic Sobolev Inequalities for Finite Markov Chains. The Annals of Applied Probability, 6(3):695–750, 1996.
  • Djolonga and Krause [2014] Josip Djolonga and Andreas Krause. From map to marginals: Variational inference in bayesian submodular models. In Neural Information Processing Systems (NIPS), 2014.
  • Djolonga and Krause [2015] Josip Djolonga and Andreas Krause. Scalable variational inference in log-supermodular models. In International Conference on Machine Learning (ICML), 2015.
  • Djolonga et al. [2018] Josip Djolonga, Stefanie Jegelka, and Andreas Krause. Provable variational inference for constrained log-submodular models. In Neural Information Processing Systems (NeurIPS), 2018.
  • Dupuy and Bach [2018] Christophe Dupuy and Francis Bach. Learning Determinantal Point Processes in Sublinear Time.

    Proceedings of the International Conference on Artificial Intelligence and Statistics

    , 2018.
  • Elfeki et al. [2018] Mohamed Elfeki, Camille Couprie, Morgane Riviere, and Mohamed Elhoseiny. GDPP: Learning Diverse Generations Using Determinantal Point Process. arXiv:1812.00068, 2018.
  • Feldman [2018] Moran Feldman. Guess free maximization of submodular and linear sums. arXiv:1810.03813, 2018.
  • Gartrell et al. [2017] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. Low-rank Factorization of Determinantal Point Processes. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
  • Gillenwater et al. [2012] Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinantal point processes. In Advances in Neural Information Processing Systems, pages 2735–2743, 2012.
  • Gotovos et al. [2015] Alkis Gotovos, S. Hamed Hassani, and Andreas Krause. Sampling from probabilistic submodular models. In Neural Information Processing Systems (NIPS), 2015.
  • Gurvits [2009] Leonid Gurvits. On multivariate Newton-like inequalities. In Advances in Combinatorial Mathematics, pages 61–78. Springer, 2009.
  • Harshaw et al. [2019] Christopher Harshaw, Moran Feldman, Justin Ward, and Amin Karbasi. Submodular maximization beyond non-negativity: Guarantees, fast algorithms, and applications. arXiv:1904.09354, 2019.
  • Hough et al. [2006] J. Ben Hough, Manjunath Krishnapur, Yuval Peres, and Bálint Virág. Determinantal Processes and Independence. Probab. Surveys, 3:206–229, 2006.
  • Huh [2018] June Huh. Combinatorial applications of the Hodge-Riemann relations. Proceedings of the International Congress of Mathematicians, 2018.
  • Iyer and Bilmes [2015] Rishabh Iyer and Jeffrey Bilmes. Submodular point processes with applications to machine learning. In Conference on Artificial Intelligence and Statistics (AISTATS), 2015.
  • Jegelka and Sra [2018] Stefanie Jegelka and Suvrit Sra. Negative dependence, stable polynomials, and all that. NeurIPS 2018 Tutorial, 2018.
  • Khanna et al. [2017] Rajiv Khanna, Ethan Elenberg, Alexandros G Dimakis, Sahand Negahban, and Joydeep Ghosh. Scalable greedy feature selection via weak submodularity. arXiv:1703.02723, 2017.
  • Kulesza and Taskar [2011] Alex Kulesza and Ben Taskar. k-DPPs: Fixed-size determinantal point processes. In Proceedings of the 28th International Conference on Machine Learning, pages 1193–1200, 2011.
  • Kulesza et al. [2012] Alex Kulesza, Ben Taskar, et al. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning, 5(2–3):123–286, 2012.
  • Kwok and Adams [2012] James T Kwok and Ryan P Adams. Priors for diversity in generative latent variable models. In Advances in Neural Information Processing Systems, pages 2996–3004, 2012.
  • Li et al. [2016a] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast DPP Sampling for Nyström with Application to Kernel Methods. In International Conference on Machine Learning, pages 2061–2070, 2016a.
  • Li et al. [2016b] Chengtao Li, Suvrit Sra, and Stefanie Jegelka. Fast mixing Markov chains for strongly Rayleigh measures, DPPs, and constrained sampling. In Advances in Neural Information Processing Systems, pages 4188–4196, 2016b.
  • Li et al. [2017] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Polynomial time algorithms for dual volume sampling. In Advances in Neural Information Processing Systems, pages 5038–5047, 2017.
  • Lin and Bilmes [2012] Hui Lin and Jeff Bilmes.

    Learning mixtures of submodular shells with application to document summarization.

    In Uncertainty in Artificial Intelligence (UAI), 2012.
  • Mariet and Sra [2015] Zelda Mariet and Suvrit Sra. Fixed-point algorithms for learning determinantal point processes. In International Conference on Machine Learning, pages 2389–2397, 2015.
  • Mariet and Sra [2016a] Zelda Mariet and Suvrit Sra.

    Diversity networks: Neural network compression using determinantal point processes.

    International Conference on Learning Representations, 2016a.
  • Mariet and Sra [2017] Zelda Mariet and Suvrit Sra. Elementary symmetric polynomials for optimal experimental design sr measures. In Advances in Neural Information Processing Systems, 2017.
  • Mariet et al. [2019] Zelda Mariet, Yaniv Ovadia, and Jasper Snoek. DPPNet: Approximating Determinantal Point Processes with Deep Networks. arXiv:1901.02051, 2019.
  • Mariet and Sra [2016b] Zelda E Mariet and Suvrit Sra. Kronecker determinantal point processes. In Advances in Neural Information Processing Systems, pages 2694–2702, 2016b.
  • Mariet et al. [2018] Zelda E Mariet, Suvrit Sra, and Stefanie Jegelka. Exponentiated Strongly Rayleigh Distributions. In Advances in Neural Information Processing Systems, pages 4459–4469, 2018.
  • Rodomanov and Kropotov [2019] Anton Rodomanov and Dmitry Kropotov. A randomized coordinate descent method with volume sampling. arXiv:1904.04587, 2019.

Appendix A Proofs for operations preserving strong log-concavity

In this section we prove Theorems 2, 3, and 4.

a.1 Closure under scaled homogenization

Let us begin this section by observing that closure under homogenization and symmetric homogenization both fail for strongly log-concave polynomials. The homogenization of a polynomial is , and its symmetric homogenization is .

We will use the following lemma.

Lemma 14.

[5] with is SLC if and only if .

The counterexample is as follows: by the preceding lemma is SLC. Then note that its homogenization is . A quick computational check then shows that has eigenvalues each to one decimal place. Furthermore the symmetric homogenization of is,

and one may check that has eigenvalues to one decimal place. This shows that SLC is not closed under homogenization or symmetric homogenization. So we seek modified operations that are conserved by SLC. In Section 3 we introduced the rescaled homogenization of ,

Theorem 15.

Let be a rank matroid and be SLC where for all . For any the polynomial is SLC.

A key component of proving this theorem is the following lemma.

Lemma 16.

Let be multiaffine and SLC and suppose that is a matroid of rank and . Then is log-concave.

Proof.

Let . We compute,

Let . Note that since is of degree two, is in fact a constant and does not depend on or . Therefore, is log-concave on if and only if it is log-concave at . This happens if and only if is negative semidefinite by Lemma 40. But this is only true if and only if the matrix is negative semidefinite. By definition

and evaluating at we notice that , , and . So indeed we have that . ∎

Proof of Theorem 15.

To prove strong log-concavity we proceed by verifying the hypotheses of Theorem 37. Let and such that . The first order of business is to show that is indecomposable. If for any then the expression equals , so we may assume for some . Then note that

where is the family of independent sets of , the matroid contraction of by . We first check indecomposability of . Note that if is a loop of then the variable does not appear in and . Similarly for all . Otherwise the monomial appears in with non-zero coefficient. Since this implies that is non-zero. In particular the graph formed in the definition of indecomposability is a star centered at and therefore connected, proving that is indecomposable.

Now suppose that . Notice that is SLC, and

So and we may apply Lemma 16 to conclude is log-concave. ∎

a.2 Closure under scaled exponentiation

For matrices and and scalar we write to denote the element-wise power and to denote the Hadamard (element-wise) product. The proof of Theorem 3 and of Theorem from [5] both boil down to the following linear algebra fact.

Lemma 17.
222This result was independently discovered by Brändén and Huh [10].

Suppose is symmetric, has non-negative entries and at most one positive eigenvalue. Then also has at most one positive eigenvalue for .

To prove Lemma 17 we recall a couple of of facts from linear algebra. We shall call a matrix conditionally negative definite if for all such that .

Lemma 18.

[6] Suppose is symmetric, has positive entries, and at most one positive eigenvalue. Then is conditionally negative definite, where

is the Perron-Frobenius eigenvector of

.

Lemma 19.

[7] Suppose is conditionally negative definite. Then is conditionally negative definite for .

With these two facts in hand we are now ready to prove Lemma 17.

Proof of Lemma 17.

Assume that for all and . The general case is then obtained by a limiting argument. Since is symmetric, has positive entries, and at most one positive eigenvalue, Lemma 18 implies that is conditionally negative definite, where is the Perron-Frobenius eigenvector of . Then Lemma 19 tell us that

is also conditionally negative definite. Note the identity,

Since the entries of the Perron-Frobenius eigenvector are all strictly positive, is non-singular. We may therefore apply Sylvester’s law of inertia to conclude that and have the same number of positive eigenvalues: one.

Next, we prove Theorem 3 by showing how it reduces to exactly the statement of Lemma 17.

Proof of Theorem 3.

Consider and such that . Denoting one observes that

By Theorem 2 the matrix has at most one positive eigenvalue and so we may apply Lemma 17 to yield the result. ∎

Note Lemma 19 also permits a simplified proof of the following theorem due to Anari et al. concerning the homogeneous case.

Theorem 20.

Suppose is SLC. Then is SLC for any .

Proof.

We prove strong log-concavity of by verifying the hypotheses of Theorem 37. Assume for all . The general case is then obtained by taking point-wise limits. Let be such that . Then notice that

Since is SLC, Lemma 40 implies has at most one positive eigenvalue. So Lemma 17 implies that also has at most one positive eigenvalue, which proves the strong log-concavity of by applying Lemma 40 in the other direction. ∎

It is a reasonable question to ask whether or not the preceding theorem or Theorem 3 can be extended to the regime . This is in fact not the case. First we show that if either holds for any then it must hold for all , then we give a counterexample showing that it fails for for both cases. Note carefully that the conclusion is therefore stronger than a mere existence claim. In fact we may conclude: Theorems 3 and 20 both fail for all .

To make the following statement succinct let us define to be the set of all symmetric real-valued matrices with non-negative entries and at most one positive eigenvalue.

Lemma 21.

Suppose there is a such that: if then . Then for any : if then

Proof.

Let . We may repeatedly apply the hypothesis to conclude that for any . So in particular we may pick sufficiently big that . But now so we may apply Lemma 17 to conclude that . ∎

Corollary 22.

Suppose there is a such that: if is SLC, then is SLC. Then for any : if is SLC then is SLC.

Corollary 23.

Suppose there is a such that: if is SLC, then is SLC for . Then for any : if is SLC then is SLC.

For the counterexample, consider . The Hessian of equals

One can numerically check that has eigenvalues to one decimal place, so is log-concave by Lemma 40 and hence SLC since it is of degree . However, has Hessian equal to

which has eigenvalues to one decimal place. So is not SLC.

The same example can be used to build a counterexample to Theorem 3 in the regime . Indeed setting in we obtain an SLC polynomial such that is not SLC.

a.3 Closure under polarization

We begin by observing an algebraic identity that allows one to push derivatives inside the polarization operation .

Lemma 24.

Let . Then and for and .

Proof.

Since is symmetric in , to prove the part of the claim it suffices to prove the claim for only. Recall that the polarization of is,

where is the th elementary symmetric polynomial in variables. We begin computing directly,