Probability distributions over combinatorial families of subsets are important to a variety of problems in machine learning and related areas. Notable examples include discrete probabilistic models [11, 37, 40, 22, 26]14]; model compression ; and low-rank matrix approximations . Consequently, significant recent attention has been paid to sampling rapidly from certain structured discrete distributions [21, 36], as well as from determinantal point processes [3, 27, 28] and progress on sampling by optimization [16, 32].
Amongst these distributions, a widely used class is that of log-submodular measures. Formally, for sets , a log-submodular measure satisfies the inequality
Log-submodular measures are useful to several applications in machine learning and computer vision [26, 5]; more generally, submodular functions are widely important across machine learning [6, 25, 4].
In this note, we focus on a specific subclass of log-submodular measures, namely, strongly Rayleigh (SR) measures. These measures are intimately related to stable polynomials, a viewpoint first established in , which has proved key to uncovering their remarkable properties, both for modeling as well as for fast sampling. For instance, these measures exhibit negative association, a strong, “robust” notion of negative dependence (we formally define SR measures in Section 2).
We mention below some important examples of SR measures.
Determinantal Point Processes. A Determinantal Point Process (DPP) is a measure over subsets given by the principal minors of a positive semidefinite matrix
with eigenvalues in. Its marginal probabilities satisfy
where is the submatrix indexed by the elements in , and
is the random set distributed as a DPP. DPPs arise in random matrix theory, combinatorics, machine learning, matrix approximations, and many other areas; see e.g.,[31, 29, 30, 13, 8, 38, 26, 23, 10, 9, 28].
(Weighted) regular and balanced matroids.
The uniform distribution over the bases of certain matroids (regular matroids and balanced matroids[17, 35]) is SR, most notably, the uniform distribution over spanning trees in a graph. Here, spanning trees are viewed as subsets of edges, and the distribution is over subsets of edges.
Product measures / Bernoullis conditioned on their sum. Assume there is a weight for each element . The product measure is SR, as is its conditioning on sets of a specific cardinality , i.e., or if , and otherwise.
Strongly Rayleigh measures have been underlying recent progress in approximation algorithms [20, 1, 15, 27], graph sparsification [18, 39], extensions to the Kadison-Singer problem , finite extensions to free probability , and concentration of measure results .
Despite their importance, efficient sampling methods are only known for special cases of SR measures. In this note, we derive a provably fast mixing Markov Chain for efficiently sampling general SR measures. For our analysis, we use the recent result of  (that analyzes fast mixing for the subclass of -homogeneous SR measures), along with the closedness properties of SR measures established in the landmark work .
2 Sampling from Strongly Rayleigh Distributions
Strongly Rayleigh (SR) distributions capture the strongest form of negative dependence, while enjoying a host of other notable properties . Several important distributions exhibit the strong Rayleigh property, for example, uniform distributions over spanning trees in graphs, and more generally, the widely occurring Determinantal Point Processes. A distribution is strongly Rayleigh if its generating polynomial ,
is real stable. This means that if for all arguments of , then .
Markov Chain Sampling.
We sample from via a Markov Chain Monte Carlo method (MCMC), i.e., we run a Markov Chain with state space (the power set of ). All the chains discussed here are ergodic. The mixing time of the chain indicates the number of iterations that we must perform (after starting from an arbitrary set ) before we can consider a valid sample from . Formally, if is the total variation distance between the distribution of and after steps, then is the mixing time to sample from a distribution -close to in terms of total variation distance. We say that the chain mixes fast if is polynomial in .
Efficient sampling techniques have been studied for special cases of SR distributions. A popular method for sampling from Determinantal Point Processes uses the spectrum of the defining kernel . Generic MCMC samplers can also be derived, for example, previous work used a simple add-delete Metropolis-Hasting chain . Starting with an arbitrary set , we sample a point uniformly at random. If , we remove with probability ; if , we add it to with probability . Algorithm 1 shows the (lazy) Markov chain.
The add-delete chain can work well in practice , however, it does not always mix fast. An elementary Determinantal Point Process has non-zero measure only on sets of a fixed cardinality; for such a process (or a process close to it), the chain will stall or mix slowly.
Another special case of SR distributions are homogeneous SR measures. These measures are nonzero only for some sets of a fixed cardinality
. Examples include Bernoulli distributions conditioned on cardinality, uniform distributions on the bases of balanced matroids, and -Determinantal Point Processes. A natural MCMC sampler for these processes takes swapping steps: given a current set , it picks, uniformly at random, points and , and swaps them with probability . Algorithm 2 formalizes this procedure. Building upon results in , Anari et al.  recently showed that the mixing time for the swap sampler for homogeneous SR measures is polynomial in , , and . These results are restricted to homogeneous SR measures, and do not hold for arbitrary SR measures.
2.1 A fast mixing chain for general SR measures
In this note, we define a projection chain that works for arbitrary SR measures, and whose mixing time is polynomial in , , and . In particular, we make the results in [17, 3] accessible to general SR measures by using specific closure properties .
The resulting Markov Chain is shown in Algorithm 3. Interestingly, this sampler uses a mixture of add-delete and swap steps. Hence, intuitively, it preserves the good properties of either type of step. In general the sampled sets can have arbitrary cardinality, and hence add-delete steps are needed. If the distribution concentrates on a certain cardinality, the swap steps gain importance.
This intuition is supported by the following theorem.
If is a SR measure, the mixing time of the Markov chain in Algorithm 3 is given by
We may choose the initial set such that makes the first term in the sum logarithmic in ( in Algorithm 3).
Theorem 1 and Algorithm 3 make use of the closure of SR measures under symmetric homogenization . The idea underlying this construction is to introduce a “shadow” of the ground set , and to construct an -homogeneous SR measure on this joint ground set . Importantly, the marginal distribution on under this joint measure is exactly . The homogeneous measure leads to a fast mixing chain that is, however, not practical to implement. Hence, we reduce it to an equivalent, more efficient chain.
First, we construct a symmetric homogenization of , a measure on :
If is SR, so is its symmetric homogenization . We use this property to derive a fast-mixing chain.
The results in  show that a Markov Chain with swap steps mixes rapidly for . Precisely, they show that for any -homogeneous SR distribution on a ground set of size , a Gibbs-exchange sampler has mixing time
Here, and , leading to a mixing time of , or, equivalently,
It remains to show that the chain in Algorithm 3 is equivalent to the Gibbs-exchange sampler for . In fact, one may be tempted to implement the exchange sampler directly. However, it doubles the size of the ground set to , and always maintains a set of size . If is large, this can be impractical.
Our exchange sampler maintains a set of cardinality . In each iteration, with probability , the sampler does nothing, otherwise it proceeds. If it proceeds, it picks and uniformly at random, and exchanges them with probability
If the exchange is accepted, then the new set is .
To consider the projection of this chain onto , let , and . There are in total four possibilities for locations of and :
With probability , and , and we switch assignment of and with probability . This is equivalent to switching elements between and , i.e., an exchange step on .
With probability , we have and . In this case, independent of whether we exchange and or not, the set remains the same. Hence, in this step, remains unchanged.
With probability , we have and , and we switch with probability . This is equivalent to deleting element from .
With probability , we have and , and switch with probability . This is equivalent to adding element to .
Algorithm 3 performs those steps with exactly the same probabilities; hence, it is a projection of the exchange chain for and has the same mixing time. ∎
Remarks. By using the SR property, we obtain a clean bound for fast mixing. In certain cases, the above chain may mix slower in practice than a pure add-delete chain, since it is “lazier”, i.e., its probability of stalling is higher. However, it is guaranteed to mix well, and, in other cases, can mix much faster than the pure add-delete chain in [24, 21]. We observe both phenomena in our experiments.
Next, we empirically study the how fast our samplers converge. We compare the strongly-Rayleigh chain in Algorithm 3 (Mix) against a simple add-delete chain (Add-Delete). To monitor the convergence of these Markov chains, we use potential scale reduction factor (PSRF) [19, 12]
that runs several chains in parallel and compares within-chain variances to between-chain variances. Typically, PSRF is greater than 1 and will converge to 1 in the limit; if it is close to 1 we empirically conclude that chains have mixed. Throughout experiments we run 10 chains in parallel for estimation, and declare “convergence” at a PSRF of 1.05.
We use a Dpp on Ailerons data111http://www.dcc.fc.up.pt/657l̃torgo/Regression/DataSets.html of size 200, and the corresponding PSRF is shown in Fig. 0(b). We observe that Mix converges slightly slower than Add-Delete since it is lazier. However, the Add-Delete chain does not always mix fast. Fig. 0(c) illustrates a different setting, where we modify the eigenspectrum of the kernel matrix: the first 100 eigenvalues are 500 and others 1/500. Such a kernel corresponds to almost an elementary Dpp, where the size of the observed subsets sharply concentrates around 100. Here, Add-Delete moves very slowly. Mix, in contrast, has the ability to exchange elements and thus converges much faster than Add-Delete.
- Anari and Gharan  Nima Anari and Shayan O. Gharan. Effective-resistance-reducing flows and asymmetric tsp. In IEEE Symposium on Foundations of Computer Science (FOCS), 2015.
- Anari and Gharan  Nima Anari and Shayan Oveis Gharan. The kadison-singer problem for strongly rayleigh measures and applications to asymmetric tsp. arXiv preprint arXiv:1412.1143, 2014.
- Anari et al.  Nima Anari, Shayan Oveis Gharan, and Alireza Rezaei. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. COLT, 2016.
- Bach  Francis Bach. Learning with Submodular Functions: A Convex Optimization Perspective. Foundations and Trends in Machine Learning, 2013.
- Barinova et al.  Olga Barinova, Victor Lempitsky, and Pushmeet Kohli. On detection of multiple object instances using Hough transforms. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9:1773–1784, 2012.
- Bilmes  Jeff Bilmes. Deep mathematical properties of submodularity with applications to machine learning. Tutorial at the Conference on Neural Information Processing Systems (NIPS), 2013.
- Borcea et al.  Julius Borcea, Petter Brändén, and Thomas Liggett. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, 22(2):521–567, 2009.
- Borodin  Alexei Borodin. Determinantal point processes. arXiv:0911.1153, 2009.
- Borodin and Gorin  Alexei Borodin and Vadim Gorin. Lectures on integrable probability, 2012.
- Borodin and Olshanski  Alexei Borodin and Grigori Olshanski. Distributions on partitions, point processes, and the hypergeometric kernel. Communications in Mathematical Physics, 211(2):335–358, 2000.
- Bouchard-Côté and Jordan  Alexandre Bouchard-Côté and Michael I. Jordan. Variational inference over combinatorial spaces. In NIPS, 2010.
- Brooks and Gelman  Stephen P Brooks and Andrew Gelman. General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics, pages 434–455, 1998.
- Bufetov  Alexander I Bufetov. Infinite determinantal measures. arXiv:1207.6793, 2012.
- Cesa-Bianchi and Lugosi  Nicolo Cesa-Bianchi and Gabor Lugosi. Combinatorial bandits. In COLT, 2009.
- Deshpande et al.  Amit Deshpande, Luis Rademacher, Santosh Vempala, and Grant Wang. Matrix approximation and projective clustering via volume sampling. Theory of Computing, 2:225–247, 2006.
- Ermon et al.  Stefano Ermon, Carla P Gomes, Ashish Sabharwal, and Bart Selman. Embed and project: Discrete sampling with universal hashing. In NIPS, pages 2085–2093, 2013.
- Feder and Mihail  Tomás Feder and Milena Mihail. Balanced matroids. In STOC, pages 26–38, 1992.
- Frieze et al.  Alan Frieze, Navin Goyal, Luis Rademacher, and Santosh Vempala. Expanders via random spanning trees. SIAM Journal on Computing, 43(2):497–513, 2014.
- Gelman and Rubin  Andrew Gelman and Donald B Rubin. Inference from iterative simulation using multiple sequences. Statistical science, pages 457–472, 1992.
- Gharan et al.  Shayan O. Gharan, Amin Saberi, and Mohit Singh. A randomized rounding approach to the Traveling Salesman Problem. In IEEE Symposium on Foundations of Computer Science (FOCS), 2011.
- Gotovos et al.  Alkis Gotovos, Hamed Hassani, and Andreas Krause. Sampling from probabilistic submodular models. In NIPS, pages 1936–1944, 2015.
Greig et al. 
Dorothy M. Greig, Bruce T. Porteous, and Allan H. Seheult.
Exact maximum a posteriori estimation for binary images.Journal of the Royal Statistical Society, 51(2), 1989.
- Hough et al.  J Ben Hough, Manjunath Krishnapur, Yuval Peres, and Bálint Virág. Determinantal processes and independence. Probability Surveys, 2006.
- Kang  Byungkon Kang. Fast determinantal point process sampling with application to clustering. In NIPS, pages 2319–2327, 2013.
- Krause and Jegelka  Andreas Krause and Stefanie Jegelka. Submodularity in Machine Learning: New directions. Tutorial at the International Conference on Machine Learning (ICML), 2013.
- Kulesza and Taskar  Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning. Foundations and Trends in Machine Learning. Now, 2012.
- Li et al. [2016a] Chengtao Li, Stefanie Jegelka, and Suvrit Sra. Fast DPP sampling for Nyström with application to kernel methods. ICML, 2016a.
- Li et al. [2016b] Chengtao Li, Suvrit Sra, and Stefanie Jegelka. Gaussian quadrature for matrix inverse forms with applications. ICML, 2016b.
- Lyons  Russell Lyons. Determinantal probability measures. Publications Mathématiques de l’Institut des Hautes Études Scientifiques, 98:167–212, 2003.
- Lyons  Russell Lyons. Determinantal probability: basic properties and conjectures. arXiv:1406.2707, 2014.
- Macchi  Odile Macchi. The coincidence approach to stochastic point processes. Advances in Applied Probability, pages 83–122, 1975.
- Maddison et al.  Chris J Maddison, Daniel Tarlow, and Tom Minka. A* sampling. In NIPS, 2014.
- Marcus  Adam W Marcus. Polynomial convolutions and (finite) free probability, 2016.
- Mariet and Sra  Zelda Mariet and Suvrit Sra. Diversity networks. In ICLR, 2016.
- Pemantle and Peres  Robin Pemantle and Yuval Peres. Concentration of lipschitz functionals of determinantal and other strong rayleigh measures. Combinatorics, Probability and Computing, 23(01):140–160, 2014.
- Rebeschini and Karbasi  Patrick Rebeschini and Amin Karbasi. Fast mixing for discrete point processes. COLT, 2015.
- Smith and Eisner  David A. Smith and Jason Eisner. Dependency parsing by belief propagation. In EMNLP, 2008.
- Soshnikov  Alexander Soshnikov. Determinantal random point fields. Russian Mathematical Surveys, 55(5):923–975, 2000.
- Spielman and Srivastava  Dan Spielman and Nikhil Srivastava. Graph sparsification by effective resistances. In Symposium on Theory of Computing (STOC), 2008.
- Zhang et al.  Jian Zhang, Josip Djolonga, and Andreas Krause. Higher-order inference for multi-class log-supermodular models. In ICCV, pages 1859–1867, 2015.