# Fast Mixing Markov Chains for Strongly Rayleigh Measures, DPPs, and Constrained Sampling

We study probability measures induced by set functions with constraints. Such measures arise in a variety of real-world settings, where prior knowledge, resource limitations, or other pragmatic considerations impose constraints. We consider the task of rapidly sampling from such constrained measures, and develop fast Markov chain samplers for them. Our first main result is for MCMC sampling from Strongly Rayleigh (SR) measures, for which we present sharp polynomial bounds on the mixing time. As a corollary, this result yields a fast mixing sampler for Determinantal Point Processes (DPPs), yielding (to our knowledge) the first provably fast MCMC sampler for DPPs since their inception over four decades ago. Beyond SR measures, we develop MCMC samplers for probabilistic models with hard constraints and identify sufficient conditions under which their chains mix rapidly. We illustrate our claims by empirically verifying the dependence of mixing times on the key factors governing our theoretical bounds.

## Authors

• 11 publications
• 51 publications
• 65 publications
• ### Fast Sampling for Strongly Rayleigh Measures with Application to Determinantal Point Processes

In this note we consider sampling from (non-homogeneous) strongly Raylei...
07/13/2016 ∙ by Chengtao Li, et al. ∙ 0

• ### The mixing time of the swap (switch) Markov chains: a unified approach

Since 1997 a considerable effort has been spent to study the mixing time...
03/15/2019 ∙ by Péter L. Erdős, et al. ∙ 0

• ### Flexible Modeling of Diversity with Strongly Log-Concave Distributions

Strongly log-concave (SLC) distributions are a rich class of discrete pr...
06/12/2019 ∙ by Joshua Robinson, et al. ∙ 4

• ### Polynomial Time Algorithms for Dual Volume Sampling

We study dual volume sampling, a method for selecting k columns from an ...
03/08/2017 ∙ by Chengtao Li, et al. ∙ 0

• ### Projecting Markov Random Field Parameters for Fast Mixing

Markov chain Monte Carlo (MCMC) algorithms are simple and extremely powe...
11/05/2014 ∙ by Xianghang Liu, et al. ∙ 0

• ### Dependence and mixing for perturbations of copula-based Markov chains

This paper explores the impact of perturbations of copulas on dependence...
06/10/2021 ∙ by Martial Longla, et al. ∙ 0

• ### Correlation decay for hard spheres via Markov chains

We improve upon all known lower bounds on the critical fugacity and crit...
01/15/2020 ∙ by Tyler Helmuth, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Distributions over subsets of objects arise in a variety of machine learning applications. They occur as discrete probabilistic models

[5, 36, 38, 20, 28]

in computer vision, computational biology and natural language processing. They also occur in combinatorial bandit learning

[9]

, as well as in recent applications to neural network compression

[32] and matrix approximations [29].

Yet, practical use of discrete distributions can be hampered by computational challenges due to their combinatorial nature. Consider for instance sampling, a task fundamental to learning, optimization, and approximation. Without further restrictions, efficient sampling can be impossible [13]. Several lines of work thus focus on identifying tractable sub-classes, which in turn have had wide-ranging impacts on modeling and algorithms. Important examples include the Ising model [22], matchings (and the matrix permanent) [23], spanning trees (and graph algorithms) [6, 16, 37, 2], and Determinantal Point Processes (Dpps) that have gained substantial attention in machine learning [28, 30, 17, 24, 3, 26].

In this work, we extend the classes of tractable discrete distributions. Specifically, we consider the following two classes of distributions on (the set of subsets of a ground set ): (1) strongly Rayleigh (SR) measures, and (2) distributions with certain cardinality or matroid-constraints. We analyze Markov chains for sampling from both classes. As a byproduct of our analysis, we answer a long-standing question about rapid mixing of MCMC sampling from DPPs.

SR measures are defined by strong negative correlations, and have recently emerged as valuable tools in the design of algorithms [2], in the theory of polynomials and combinatorics [4], and in machine learning through Dpps, a special case of SR distributions. Our first main result is the first polynomial-time sampling algorithm that applies to all SR measures (and thus a fortiori to Dpps).

General distributions on with constrained support (case (2) above) typically arise upon incorporating prior knowledge or resource constraints. We focus on resource constraints such as bounds on cardinality and bounds on including limited items from sub-groups. Such constraints can be phrased as a family of subsets; we say satisfies the constraint iff . Then the distribution of interest is of the form

 πC(S)∝exp(βF(S))⟦S∈C⟧, (1.1)

where is a set function that encodes relationships between items , is the Iverson bracket, and a constant (also referred to as the inverse temperature). Most prior work on sampling with combinatorial constraints (such as sampling the bases of a matroid), assumes that breaks up linearly using element-wise weights , i.e., . In contrast, we allow generic, nonlinear functions, and obtain a mixing times governed by structural properties of .

##### Contributions.

We briefly summarize the key contributions of this paper below.

• We derive a provably fast mixing Markov chain for efficient sampling from strongly Rayleigh measure (Theorem 2). This Markov chain is novel and may be of independent interest. Our results provide the first polynomial guarantee (to our knoweldge) for Markov chain sampling from a general Dpp, and more generally from an SR distribution.111The analysis in [24] is not correct since it relies on a wrong construction of path coupling.

• We analyze (Theorem 4) mixing times of an exchange chain when the constraint family is the set of bases of a special matroid, i.e., or obeys a partition constraint. Both of these constraints have high practical relevance [27, 25, 38].

• We analyze (Theorem 6) mixing times of an add-delete chain for the case , which, perhaps surprisingly, turns out to be quite different from . This constraint can be more practical than the strict choice , because in many applications, the user may have an upper bound on the budget, but may not necessarily want to expend all units.

Finally, a detailed set of experiments illustrates our theoretical results.

##### Related work.

Recent work in machine learning addresses sampling from distributions with sub- or supermodular  [19, 34], determinantal point processes [3, 29], and sampling by optimization [14, 31]. Many of these works (necessarily) make additional assumptions on , or are approximate, or cannot handle constraints. Moreover, the constraints cannot easily be included in : an out-of-the-box application of the result in [19], for instance, would lead to an unbounded constant in the mixing time.

Apart from sampling, other related tracts include work on variational inference for combinatorial distributions [5, 11, 36, 38] and inference for submodular processes [21]. Special instances of (1.1) include [27], where the authors limit Dpps to sets that satisfy ; partition matroid constraints are studied in [25], while the budget constraint has been used recently in learning Dpp[17]. Important existing results show fast mixing for a sub-family of strongly Rayleigh distributions [15, 3]; but those results do not include, for instance, general Dpps.

### 1.1 Background and Formal Setup

Before describing the details of our new contributions, let us briefly recall some useful background that also serves to set the notation. Our focus is on sampling from in (1.1); we denote by and . The simplest example of

is the uniform distribution over sets in

, where is constant. In general, may be highly nonlinear.

We sample from using MCMC, i.e., we run a Markov Chain with state space . All our chains are ergodic. The mixing time of the chain indicates the number of iterations that we must perform (after starting from an arbitrary set ) before we can consider as a valid sample from . Formally, if is the total variation distance between the distribution of and after steps, then is the mixing time to sample from a distribution -close to in terms of total variation distance. We say that the chain mixes fast if is polynomial in

. The mixing time can be bounded in terms of the eigenvalues of the transition matrix, as the following classic result shows:

###### Theorem 1 (Mixing Time [10]).

Let be the eigenvalues of the transition matrix, and . Then, the mixing time starting from an initial set is bounded as

 τX0(ε)≤(1−λmax)−1(logπC(X0)−1+logε−1).

Most of the effort in bounding mixing times hence is devoted to bounding this eigenvalue.

## 2 Sampling from Strongly Rayleigh Distributions

In this section, we consider sampling from strongly Rayleigh (SR) distributions. Such distributions capture the strongest form of negative dependence properties, while enjoying a host of other remarkable properties [4]. For instance, they include the widely used Dpps as a special case. A distribution is SR if its generating polynomial , is real stable. This means if for all arguments of , then .

We show in particular that SR distributions are amenable to efficient Markov chain sampling. Our starting point is the observation of [4] on closure properties of SR measures; of these we use symmetric homogenization. Given a distribution on , its symmetric homogenization on is

 πsh(S):={π(S∩[N])(NS∩[N])−1if |S|=N;0otherwise.

If is SR, so is . We use this property below in our derivation of a fast-mixing chain.

We use here a recent result of Anari et al. [3], who show a Markov chain that mixes rapidly for homogeneous SR distributions. These distributions are over all subsets of some fixed size , and hence do not include general Dpps. Concretely, for any -homogeneous SR distribution , a Gibbs-exchange sampler has mixing time

 τX0(ε)≤2k(N−k)(logπ(X0)−1+logε−1).

This sampler uniformly samples one item in the current set, and one outside the current set, and swaps them with an appropriate probability. Using these ideas we show how to obtain fast mixing chains for any general SR distribution on . First, we construct its symmetric homogenization , and sample from using a Gibbs-exchange sampler. This chain is fast mixing, thus we will efficiently get a sample . The corresponding sample for can be then obtained by computing . Theorem 2, proved in the appendix, formally establishes the validity of this idea.

###### Theorem 2.

If is SR, then the mixing time of a Gibbs-exchange sampler for is bounded as

 τX0(ε)≤2N2(log(N|X0|)+log(π(X0))−1+logε−1). (2.1)

For Theorem 2 we may choose the initial set such that makes the first term in the sum logarithmic in ( in Algorithm 1).

Efficient Implementation. Directly running a chain to sample items from a (doubled) set of size adds some computational overhead. Hence, we construct an equivalent, more space-efficient chain (Algorithm 1) on the initial ground set that only manintains . Interestingly, this sampler is a mixture of add-delete and Gibbs-exchange samplers. This combination makes sense intuitively, too: add-delete moves (also shown in Alg. 3) are needed since the exchange sampler cannot change the cardinality of . But a pure add-delete chain can stall if the sets concentrate around a fixed cardinality (low probability of a larger or smaller set). Exchange moves will not suffer the same high rejection rates. The key idea underlying Algorithm 1 is that the elements in are indistinguishable, so it suffices to maintain merely the cardinality of the currently selected subset instead of all its indices. Appendix C contains a detailed proof.

###### Corollary 3.

The bound (2.1) applies to the mixing time of Algorithm 1.

Remarks. By assuming is SR, we obtain a clean bound for fast mixing. Compared to the bound in [19], our result avoids the somewhat opaque factor that depends on .

In certain cases, the above chain may mix slower in practice than a pure add-delete chain that was used in previous works [24, 19], since its probability of doing nothing is higher. In other cases, it mixes much faster than the pure add-delete chain; we observe both phenomena in our experiments in Sec. 4. Contrary to a simple add-delete chain, in all cases, it is guaranteed to mix well.

## 3 Sampling from Matroid-Constrained Distributions

In this section we consider sampling from an explicitly-constrained distribution where specifies certain matroid base constraints (§3.1) or a uniform matroid of a given rank (§3.2).

### 3.1 Matroid Base Constraints

We begin with constraints that are special cases of matroid bases222Drawing even a uniform sample from the bases of an arbitrary matroid can be hard.:

1. Uniform matroid: ,

2. Partition matroid: Given a partition , we allow sets that contain exactly one element from each : .

An important special case of a distribution with a uniform matroid constraint is the -DPP [27]. Partition matroids are used in multilabel problems [38], and also in probabilistic diversity models [21].

The sampler is shown in Algorithm 2. At each iteration, we randomly select an item and such that the new set satisfies , and swap them with certain probability. For uniform matroids, this means ; for partition matroids, where is the part that resides in. The fact that the chain has stationary distribution can be inferred via detailed balance. Similar to the analysis in [19] for unconstrained sampling, the mixing time depends on a quantity that measures how much deviates from linearity: . Our proof, however, differs from that of [19]. While they use canonical paths [10], we use multicommodity flows, which are more effective in our constrained setting.

###### Theorem 4.

Consider the chain in Algorithm 2. For the uniform matroid, is bounded as

 τX0(ε)≤4k(N−k)exp(β(2ζF))(logπC(X0)−1+logε−1); (3.1)

For the partition matroid, the mixing time is bounded as

 τX0(ε)≤4k2maxi|Pi|exp(β(2ζF))(logπC(X0)−1+logε−1). (3.2)

Observe that if ’s form an equipartition, i.e., for all , then the second bound becomes . For , the mixing times depend as on . For uniform matroids, the time is equally small if is close to . Finally, the time depends on the initialization, . If is monotone increasing, one may run a simple greedy algorithm to ensure that is large. If is monotone submodular, this ensures that .

Our proof uses a multicommodity flow to upper bound the largest eigenvalue of the transition matrix. Concretely, let be the set of all simple paths between states in the state graph of Markov chain, we construct a flow that assigns a nonnegative flow value to any simple path between any two states (sets) . Each edge in the graph has a capacity where is the transition probability from to . The total flow sent from to must be : if is the set of all simple paths from to , then we need . Intuitively, the mixing time relates to the congestion in any edge, and the length of the paths. If there are many short paths across which flow can be distributed, then mixing is fast. This intuition is captured in a fundamental theorem:

###### Theorem 5 (Multicommodity Flow [35]).

Let be the set of edges in the transition graph, and the transition probability. Define

 ¯¯¯ρ(f)=maxe∈E1Q(e)∑p∋ef(p)len(p),

where the length of the path . Then .

With this property of multicommodity flow, we are ready to prove Thm. 4.

###### Proof.

(Theorem 4) We sketch the proof for partition matroids; the full proofs is in Appendix A. For any two sets , we distribute the flow equally across all shortest paths in the transition graph and bound the amount of flow through any edge .

Consider two arbitrary sets with symmetric difference , i.e., elements need to be exchanged to reach from to . However, these steps are a valid path in the transition graph only if every set along the way is in . The exchange property of matroids implies that this requirement is indeed true, so any shortest path has length . Moreover, there are exactly such paths, since we can exchange the elements in in any order to reach at . Note that once we choose to swap out, there is only one choice to swap in, where lies in the same part as in the partition matroid, otherwise the constraint will be violated. Since the total flow is , each path receives flow.

Next, let be any edge on some shortest path ; so and for some . Let be the length of the shortest path , i.e., elements need to be exchanged to reach from to . Similarly, elements are exchanged to reach from to . Since there is a path for every permutation of those elements, the ratio of the total flow that edge receives from pair , and , becomes

 we(X,Y)Q(e)≤2r!(m−1−r)!kLm!ZCexp(2βζF)(exp(βF(σS(X,Y)))+exp(βF(σT(X,Y)))), (3.3)

where we define . To bound the total flow, we must count the pairs such that is on their shortest path(s), and bound the flow they send. We do this in two steps, first summing over all ’s that share the upper bound (3.3) since they have the same difference sets and , and then we sum over all possible and . For fixed , , there are pairs that share those difference sets, since the only freedom we have is to assign of the elements in to , and the rest to . Hence, for fixed . Appropriate summing and canceling then yields

 ∑(X,Y):σS(X,Y)=US,σT(X,Y)=UTwe(X,Y)Q(e) ≤2kLZCexp(2βζF)(exp(βF(US))+exp(βF(UT))). (3.4)

Finally, we sum over all valid ( is determined by ). One can show that any valid , and hence , and likewise for . Hence, summing the bound (3.4) over all possible choices of yields

 ¯¯¯ρ(f)≤4kLexp(2βζF)maxplen(p)≤4k2Lexp(2βζF),

where we upper bound the length of any shortest path by , since . Hence

 τX0(ε)≤4k2Lexp(2βζF)(logπ(X0)−1+logε−1).\qed

For more restrictive constraints, there are fewer paths, and the bounds can become larger. Appendix A shows the general dependence on (as ). It is also interesting to compare the bound on uniform matroid in Eq. (3.1) to that shown in [3] for a sub-class of distributions that satisfy the property of being homogeneous strongly Rayleigh333Appendix C contains details about strongly Rayleigh distributions.. If is homogeneous strongly Rayleigh, we have . In our analysis, without additional assumptions on , we pay a factor of for generality. This factor is one for some strongly Rayleigh distributions (e.g., if is modular), but not for all.

### 3.2 Uniform Matroid Constraint

We consider constraints that is a uniform matroid of certain rank: . We employ the lazy add-delete Markov chain in Algo. 3, where in each iteration, with probability 0.5 we uniformly randomly sample one element from and either add it to or delete it from the current set, while respecting constraints. To show fast mixing, we consider using path coupling, which essentially says that if we have a contraction of two (coupling) chains then we have fast mixing. We construct path coupling on a carefully generated graph with edges (from a proper metric). With all details in Appendix B we end up with the following theorem:

###### Theorem 6.

Consider the chain shown in Algorithm 3. Let where and are functions of edges and are defined as

 α1= ∑i∈T|p−(T,i)−p−(S,i)|++⟦|S|

where . The summations over absolute differences quantify the sensitivity of transition probabilities to adding/deleting elements in neighboring . Assuming , we get

 τ(ε)≤2Nlog(Nε−1)1−α

Remarks. If is less than 1 and independent of , then the mixing time is nearly linear in . The condition is conceptually similar to those in [34, 29]. The fast mixing requires both and , specifically, the change in probability when adding or deleting single element to neighboring subsets, to be small. Such notion is closely related to the curvature of discrete set functions.

## 4 Experiments

We next empirically study the dependence of sampling times on key factors that govern our theoretical bounds. In particular, we run Markov chains on chain-structured Ising models on a partition matroid base and DPPs on a uniform matroid, and consider estimating marginal and conditional probabilities of a single variable. To monitor the convergence of Markov chains, we use

potential scale reduction factor (PSRF) [18, 7]

that runs several chains in parallel and compares within-chain variances to between-chain variances. Typically, PSRF is greater than 1 and will converge to 1 in the limit; if it is close to 1 we empirically conclude that chains have mixed well. Throughout experiments we run 10 chains in parallel for estimations, and declare “convergence” at a PSRF of 1.05.

We first focus on small synthetic examples where we can compute exact marginal and conditional probabilities. We construct a 20-variable chain-structured Ising model as

 πC(S)∝exp(β((δ19∑i=1wi(si⊕si+1))+(1−δ)|S|))⟦S∈C⟧,

where the are 0-1 encodings of , and the are drawn uniformly randomly from . The parameters govern bounds on the mixing time via ; the smaller , the smaller .

is a partition matroid of rank 5. We estimate conditional probabilities of one random variable conditioned on 0, 1 and 2 other variables and compare against the ground truth. We set

to be , and and results are shown in Fig. 1. All marginals and conditionals converge to their true values, but with different speed. Comparing Fig. 0(a) against 0(b), we observe that with fixed , increase in slows down the convergence, as expected. Comparing Fig. 0(b) against 0(c), we observe that with fixed , decrease in speeds up the convergence, also as expected given our theoretical results. Appendix D.1 and D.2 illustrate the convergence of estimations under other settings.

We also check convergence on larger models. We use a DPP on a uniform matroid of rank 30 on the Ailerons data (http://www.dcc.fc.up.pt/657l̃torgo/Regression/DataSets.html) of size 200. Here, we do not have access to the ground truth, and hence plot the estimation mean with standard deviations among 10 chains in 2(a). We observe that the chains will eventually converge, i.e., the mean becomes stable and variance small. We also use PSRF to approximately judge the convergence. More results can be found in Appendix D.3.

Furthermore, the mixing time depends on the size of the ground set. We use a DPP on Ailerons and vary from 50 to 1000. Fig. 1(a) shows the PSRF from 10 chains for each setting. By thresholding PSRF at 1.05 in Fig. 1(b) we see a clearer dependence on . At this scale, the mixing time grows almost linearly with , indicating that this chain is efficient at least at small to medium scale.

Finally, we empirically study how fast our sampler on strongly Rayleigh distribution converges. We compare the chain in Algorithm 1 (Mix) against a simple add-delete chain (Add-Delete). We use a DPP on Ailerons data of size 200, and the corresponding PSRF is shown in Fig. 2(b). We observe that Mix converges slightly slower than Add-Delete since it is lazier. However, the Add-Delete chain does not always mix fast. Fig. 2(c) illustrates a different setting, where we modify the eigenspectrum of the kernel matrix: the first 100 eigenvalues are 500 and others 1/500. Such a kernel corresponds to almost an elementary DPP, where the size of the observed subsets sharply concentrates around 100. Here, Add-Delete moves very slowly. Mix, in contrast, has the ability of exchanging elements and thus converges way faster than Add-Delete.

## 5 Discussion and Open Problems

We presented theoretical results on Markov chain sampling for discrete probabilistic models subject to implicit and explicit constraints. In particular, under an implicit constraint that the probability measure is strongly Rayleigh, we obtain an unconditional fast mixing guarantee. For distributions with various explicit constraints we showed sufficient conditions for fast mixing. We show empirically that the dependencies of mixing times on various factors are consistent with our theoretical analysis.

There still exist many open problems in both implicitly- and explicitly-constrained settings. Many bounds that we show depend on structural quantities ( or ) that may not always be easy to quantify in practice. It will be valuable to develop chains on special classes of distributions (like we did for strongly Rayleigh) whose mixing time is independent of these factors. Moreover, we only considered matroid bases or uniform matroids, while several important settings such as knapsack constraints remain open. In fact, even uniform sampling with a knapsack constraint is not easy; a mixing time of is known [33]. We defer the development of similar or better bounds, potentially with structural factors like , on specialized discrete probabilistic models as our future work.

Acknowledgements. This research was partially supported by NSF CAREER 1553284 and a Google Research Award. We thank Ruilin Li for pointing out typos.

## References

• Aldous [1982] D. J. Aldous. Some inequalities for reversible Markov chains. Journal of the London Mathematical Society, pages 564–576, 1982.
• Anari and Gharan [2015] N. Anari and S. O. Gharan. Effective-resistance-reducing flows and asymmetric tsp. In FOCS, 2015.
• Anari et al. [2016] N. Anari, S. O. Gharan, and A. Rezaei. Monte Carlo Markov chain algorithms for sampling strongly Rayleigh distributions and determinantal point processes. In COLT, 2016.
• Borcea et al. [2009] J. Borcea, P. Brändén, and T. Liggett. Negative dependence and the geometry of polynomials. Journal of the American Mathematical Society, pages 521–567, 2009.
• Bouchard-Côté and Jordan [2010] A. Bouchard-Côté and M. I. Jordan. Variational inference over combinatorial spaces. In NIPS, 2010.
• Broder [1989] A. Broder. Generating random spanning trees. In FOCS, pages 442–447, 1989.
• Brooks and Gelman [1998] S. P. Brooks and A. Gelman. General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics, pages 434–455, 1998.
• Bubley and Dyer [1997] R. Bubley and M. Dyer. Path coupling: A technique for proving rapid mixing in Markov chains. In FOCS, pages 223–231, 1997.
• Cesa-Bianchi and Lugosi [2009] N. Cesa-Bianchi and G. Lugosi. Combinatorial bandits. In COLT, 2009.
• Diaconis and Stroock [1991] P. Diaconis and D. Stroock. Geometric bounds for eigenvalues of Markov chains. The Annals of Applied Probability, pages 36–61, 1991.
• Djolonga and Krause [2014] J. Djolonga and A. Krause. From MAP to marginals: Variational inference in bayesian submodular models. In NIPS, pages 244–252, 2014.
• Dyer and Greenhill [1998] M. Dyer and C. Greenhill. A more rapidly mixing Markov chain for graph colorings. Random Structures and Algorithms, pages 285–317, 1998.
• Dyer et al. [1999] M. Dyer, A. Frieze, and M. Jerrum. On counting independent sets in sparse graphs. In FOCS, 1999.
• Ermon et al. [2013] S. Ermon, C. P. Gomes, A. Sabharwal, and B. Selman. Embed and project: Discrete sampling with universal hashing. In NIPS, pages 2085–2093, 2013.
• Feder and Mihail [1992] T. Feder and M. Mihail. Balanced matroids. In STOC, pages 26–38, 1992.
• Frieze et al. [2014] A. Frieze, N. Goyal, L. Rademacher, and S. Vempala. Expanders via random spanning trees. SIAM Journal on Computing, 43(2):497–513, 2014.
• Gartrell et al. [2016] M. Gartrell, U. Paquet, and N. Koenigstein. Low-rank factorization of determinantal point processes for recommendation. arXiv:1602.05436, 2016.
• Gelman and Rubin [1992] A. Gelman and D. B. Rubin. Inference from iterative simulation using multiple sequences. Statistical science, pages 457–472, 1992.
• Gotovos et al. [2015] A. Gotovos, H. Hassani, and A. Krause. Sampling from probabilistic submodular models. In NIPS, 2015.
• Greig et al. [1989] D. M. Greig, B. T. Porteous, and A. H. Seheult.

Exact maximum a posteriori estimation for binary images.

Journal of the Royal Statistical Society, 1989.
• Iyer and Bilmes [2015] R. Iyer and J. Bilmes. Submodular point processes. In AISTATS, 2015.
• Jerrum and Sinclair [1993] M. Jerrum and A. Sinclair. Polynomial-time approximation algorithms for the Ising model. SIAM J. Computing, 1993.
• Jerrum et al. [2004] M. Jerrum, A. Sinclair, and E. Vigoda. A polynomial-time approximation algorithm for the permanent of a matrix with nonnegative entries. JACM, 2004.
• Kang [2013] B. Kang. Fast determinantal point process sampling with application to clustering. In NIPS, pages 2319–2327, 2013.
• Kathuria and Deshpande [2016] T. Kathuria and A. Deshpande. On sampling from constrained diversity promoting point processes. 2016.
• Kojima and Komaki [2014] M. Kojima and F. Komaki. Determinantal point process priors for Bayesian variable selection in linear regression. arXiv:1406.2100, 2014.
• Kulesza and Taskar [2011] A. Kulesza and B. Taskar. k-DPPs: Fixed-size determinantal point processes. In ICML, pages 1193–1200, 2011.
• Kulesza and Taskar [2012] A. Kulesza and B. Taskar. Determinantal point processes for machine learning. arXiv preprint arXiv:1207.6083, 2012.
• Li et al. [2016a] C. Li, S. Jegelka, and S. Sra. Fast DPP sampling for Nyström with application to kernel methods. In ICML, 2016a.
• Li et al. [2016b] C. Li, S. Sra, and S. Jegelka. Gaussian quadrature for matrix inverse forms with applications. In ICML, 2016b.
• Maddison et al. [2014] C. J. Maddison, D. Tarlow, and T. Minka. A* sampling. In NIPS, 2014.
• Mariet and Sra [2016] Z. Mariet and S. Sra. Diversity networks. In ICLR, 2016.
• Morris and Sinclair [2004] B. Morris and A. Sinclair. Random walks on truncated cubes and sampling 0-1 knapsack solutions. SIAM journal on computing, pages 195–226, 2004.
• Rebeschini and Karbasi [2015] P. Rebeschini and A. Karbasi. Fast mixing for discrete point processes. In COLT, 2015.
• Sinclair [1992] A. Sinclair. Improved bounds for mixing rates of Markov chains and multicommodity flow. Combinatorics, probability and Computing, pages 351–370, 1992.
• Smith and Eisner [2008] D. Smith and J. Eisner. Dependency parsing by belief propagation. In EMNLP, 2008.
• Spielman and Srivastava [2008] D. Spielman and N. Srivastava. Graph sparsification by effective resistances. In STOC, 2008.
• Zhang et al. [2015] J. Zhang, J. Djolonga, and A. Krause. Higher-order inference for multi-class log-supermodular models. In ICCV, pages 1859–1867, 2015.

## Appendix A Proof of Thm. 4

### a.1 Proof for Uniform Matroid Base

###### Proof.

We consider the case where is uniform matroid base. For any two sets , we distribute the flow equally across all shortest paths in the transition graph. Then, for arbitrary edge , we bound the number of paths (and flow) through .

Consider two arbitrary sets with symmetric difference . Any shortest path has length . Moreover, there are exactly such paths, since we can exchange the elements in in any order with the elements in in any order to reach at . Since the total flow is , each path receives flow.

Next, let be any edge on some shortest path ; so and for some . Let be the length of the shortest path , thus there are ways to reach from to . Similarly, elements are exchanged to reach from to and there are in total ways to do so. the total flow receives from pair is

 we(X,Y)=πC(X)πC(Y)(m!)2(r!)2((m−1−r)!)2

Since in our chain,

 Q(e) =2ZCexp(βF(S))exp(βF(T))k(N−k)(exp(βF(S))+exp(βF(T))),

it follows that

 we(X,Y)Q(e) =2(r!)2((m−1−r)!)2k(N−k)exp(β(F(X)+F(Y)))(exp(βF(S))+exp(βF(T)))(m!)2ZCexp(β(F(S)+F(T))) ≤2(r!)2((m−1−r)!)2k(N−k)(m!)2ZCexp(2βζF)(exp(βF(σS(X,Y)))+exp(βF(σT(X,Y)))),

where we define . The inequality draws from the fact that

 exp(β(F(X)+F(Y)+F(S)))exp(β(F(S)+F(T)))=exp(β(F(X)+F(Y)−F(T)) =exp(β(F(X)+F(Y)−F(X∩Y)−F(X∪Y))) exp(β(F(X∩Y)+F(X∪Y)−F(T)−F(σT(X,Y))))exp(βF(σT(X,Y)) ≤exp(2βζF)exp(βF(σT(X,Y)))

and likewise for . Similar trick has been used in [19].

Let and , then for fixed , the total flow that passes is

 ∑(X,Y):σS(X,Y)=US,σT(X,Y)=UTwe(X,Y)Q(e) ≤2m−1∑r=0(m−1r)2(r!)2((m−1−r)!)2k(N−k)(m!)2Z ×exp(2βζF)(exp(βF(US))+exp(βF(UT))) =2k(N−k)mZCexp(2βζF)(exp(βF(US))+exp(βF(UT))).

Finally, with the definition of we sum over all images of and . Recall that . Since we know that , thus and

 ¯¯¯ρ(f)≤4k(N−k)exp(2βζF).

Hence

 τX0(ε)≤4k(N−k)exp(2βζF)(logπC(X0)−1+logε−1).

### a.2 Proof on Partition Matroid Base

###### Proof.

Consider two arbitrary sets with symmetric difference , i.e., elements need to be exchanged to reach from to . However, these steps are a valid path in the transition graph only if every set along the way is in . The exchange property of matroids implies that this is indeed true, so any shortest path has length . Moreover, there are exactly such paths, since we can exchange the elements in in any order to reach at . Note that once we choose to swap out, there is only one choice to swap in, where lies in the same part as in the partition matroid, otherwise the constraint will be violated. Since the total flow is , each path receives flow.

Next, let be any edge on some shortest path ; so and for some . Let be the length of the shortest path , i.e., elements need to be exchanged to reach from to . Similarly, elements are exchanged to reach from to . Since there is a path for every permutation of those elements, the total flow edge receives from pair is

 we(X,Y)=πC(X)πC(Y)m!r!(m−1−r)!.

Since, in our chain, (using )

 Q(e) ≥πC(S)2kLπC(T)πC(S)+πC(T)=exp(βF(S))exp(βF(T))2kLZC(exp(βF(S))+exp(βF(T))),

it follows that

 we(X,Y)Q(e)≤2r!(m−1−r)!kLexp(β(F(X)+F(Y)))(exp(βF(S))+exp(βF(T)))m!ZCexp(β(F(S)+F(T))) ≤2r!(m−1−r)!kLm!ZCexp(2βζF)(exp(βF(σS(X,Y)))+exp(βF(σT(X,Y)))), (A.1)

where we define . To bound the total flow, we must count the pairs such that is on their shortest path(s), and bound the flow they send. We do this in two steps, first summing over all that share the upper bound (A.1) since they have the same difference sets and , and then we sum over all possible and . For fixed , , there are pairs that share those difference sets, since the only freedom we have is to assign of the elements in to , and the rest to . Hence, for fixed :

 ∑(X,Y):σS(X,Y)=US,σT(X,Y)=UTwe(X,Y)Q(e) ≤2m−1∑r=0(m−1r)r!(m−1−r)!kLm!ZC ×exp(2βζF)(exp(βF(US))+exp(βF(UT))) =2kLZCexp(2βζF)(exp(βF(US))+exp(βF(UT))). (A.2)

Finally, we sum over all valid ( is determined by ), where by “valid” we mean there exists and on one path from to such that, . Any such can be constructed by picking elements from (including ), and by replacing the remaining elements by another member of their partition: i.e., if , then it is replaced by some other , since both and must be in . Hence, any satisfies the partition constraint, i.e., and therefore , and likewise for . Hence, summing the bound (A.2) over all possible yields

 ¯¯¯ρ(f)≤4kLexp(2βζF)maxplen(p)≤4k2Lexp(2βζF),

where we upper bound the length of any shortest path by , since . Hence

 τX0(ε)≤4k2Lexp(2βζF)(logπC(X0)−1+logε−1).\qed

### a.3 Proof for General Matroid Base

In the case where no structural assumption is made on , the proof needs to be more carefully handled. Because in this case, we know neither the number of legal paths between any two states, nor the number of falls out of .

We again consider arbitrary sets where . The total number of shortest paths is at least due to exchange property of matroids. Since the amount of flow from to is , each path receives at most