    # Modified log-Sobolev inequalities for strongly log-concave distributions

We show that the modified log-Sobolev constant for a natural Markov chain which converges to an r-homogeneous strongly log-concave distribution is at least 1/r. As a consequence, we obtain an asymptotically optimal mixing time bound for this chain. Applications include the bases-exchange random walk in a matroid.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Let be a discrete distribution. Consider the generating polynomial of :

 gπ(x)=∑S⊆[n]π(S)∏xi∈Sxi.

We call a polynomial log-concave if its logarithm is concave, and strongly log-concave

if it is log-concave at the all one vector

after taking any sequence of partial derivatives. The distribution is strongly log-concave if is.

An important example of strongly log-concave distributions is the uniform distribution over the bases of a matroid

(Anari et al., 2018a; Brändén and Huh, 2019). This discovery leads to the breakthrough result that the exchange walk over the bases of a matroid is rapidly mixing Anari et al. (2018a), which implies the existence of a fully polynomial-time randomised approximation scheme (FPRAS) for the number of bases of any matroid (given by an independence oracle).

The bases-exchange walk, denoted by , is defined as follows. In each step, we remove an element from the current basis uniformly at random to get a set . Then, we move to a basis containing uniformly at random. This chain is irreducible and it converges to the uniform distribution over the bases of a matroid. Brändén and Huh (2019) showed that the support of an -homogeneous strongly log-concave distribution must be the set of bases of a matroid. Thus, to sample from , we may use a random walk similar to the above. The only change required is that in the second step we move to a basis

with probability proportional to

.

Let be a Markov chain over a state space , and be its stationary distribution. To measure the convergence rate of , we use the total variation mixing time,

 tmix(P,ε):=mint{t∣∥Pt(x0,⋅)−π∥TV≤ε},

where is the initial state and the subscript TV denotes the total variation distance between two distributions. The main goal of this paper is to show that for any -homogeneous strongly log-concave distribution ,

 (1) tmix(PBX,π,ε)≤r(loglog1πmin+log12ε2),

where . This will improve the previous bound due to Anari et al. (2018a). Since is most commonly exponentially small in the input size (e.g. when is the uniform distribution), the improvement is usually a polynomial factor. Our bound is asymptotically optimal without further assumptions, as the upper bound is achieved when is the uniform distribution over the bases of some matroids (Jerrum, 2003).111One such example is the matroid defined by a graph which is similar to a path but with two parallel edges connecting every two successive vertices instead of a single edge. Equivalently, this can be viewed as the partition matroid where each block has two elements and each basis is formed by choosing exactly one element from every block. The Markov chain in this case is just a lazy random walk on the Boolean hypercube.

Our main improvement is a modified log-Sobolev inequality for and . To introduce this inequality, we define the Dirichlet form of a reversible Markov chain , over state space , as

 EP(f,g):=∑x,y∈Ωπ(x)f(x)[I−P](x,y)g(y),

where are two functions over , and

denotes the identity matrix. Moreover, let the entropy of

be

 Entπ(f):=Eπ(flogf)−EπflogEπf,

where we follow the convention that . If we normalise , then

is the relative entropy (or Kullback–Leibler divergence) between

and .

The modified log-Sobolev constant (Bobkov and Tetali, 2006) is defined as

 ρ(P):=inf{EP(f,logf)Entπ(f)∣f:Ω→R≥0,Entπ(f)≠0}.

Our main theorem is the following, which is a special case of Theorem 6.

###### Theorem 1.

Let be an -homogeneous strongly log-concave distribution. Then

 ρ(PBX,π)≥1r.

Since (cf. Bobkov and Tetali, 2006), Theorem 1 directly implies the mixing time bound (1).

In fact, we show more than Theorem 1. Following Anari et al. (2018a) and Kaufman and Oppenheim (2018), we stratify independent sets of the matroid by their sizes, and define two random walks for each level, depending on whether they add or delete an element first. For instance, the bases-exchange walk is the “delete-add” or “down-up” walk for the top level. We give lower bounds for the modified log-Sobolev constants of both random walks for all levels. For the complete statement, see Section 3 and Theorem 6.

The previous work of Anari et al. (2018a), building upon (Kaufman and Oppenheim, 2018), focuses on the spectral gap of . It is well known that lower bounds of the modified log-Sobolev constant are stronger than those of the spectral gap. Thus, we need to seek a different approach. Our key lemma, Lemma 10, shows that the relative entropy contracts by a factor of when we go from level to level . Theorem 1 is a simple consequence of this lemma and Jensen’s inequality. In order to prove this lemma, we used a decomposition idea to inductively bound the relative entropy, which appears to be novel.

Prior to our work, similar bounds have been obtained only for strongly Rayleigh distributions, which, introduced by Borcea et al. (2009), are a proper subset of strongly log-concave distributions. Hermon and Salez (2019) showed a lower bound on the modified log-Sobolev constant for strongly Rayleigh distributions, improving upon the spectral gap bound of Anari et al. (2016). The work of Hermon and Salez (2019) builds upon the previous work of Jerrum et al. (2004) for balanced matroids (Feder and Mihail, 1992). All of these results follow an inductive framework inspired by Lee and Yau (1998), which is apparently difficult to carry out in the case of general matroids or strongly log-concave distributions. The approach we took is entirely different.

In Section 2 we introduce necessary notions and briefly review relevant background. In Section 3 we formally state our main results. In Section 4 we prove modified log-Sobolev constant lower bounds for the “down-up” walk. In Section 5 we finish by dealing with the “up-down” walk.

## 2. Preliminaries

In this section we define and give some basic properties of Markov chains, strongly log-concave distributions, and matroids.

### 2.1. Markov chains

Let be a discrete state space and be a distribution over . Let be the transition matrix of a Markov chain whose stationary distribution is . Then, for any . We say is reversible with respect to if

 (2) π(x)P(x,y)=π(y)P(y,x).

We adopt the standard notation of for a function , namely

 Eπf=∑x∈Ωπ(x)f(x).

We also view the transition matrix as an operator that maps functions to functions. More precisely, let be a function and acting on is defined as

 Pf(x):=∑y∈ΩP(x,y)f(y).

This is also called the Markov operator corresponding to . We will not distinguish the matrix from the operator as it will be clear from the context. Note that is the expectation of with respect to the distribution . We can regard a function as a column vector in , in which case is simply matrix multiplication.

The Hilbert space is given by endowing with the inner product

 ⟨f,g⟩π:=∑x∈Ωπ(x)f(x)g(x),

where . In particular, the norm in is given by .

The adjoint operator of is defined as . Indeed, is the (unique) operator that satisfies . It is easy to verify that if satisfies the detailed balanced condition (2) (so is reversible), then is self-adjoint, namely .

The Dirichlet form is defined as:

 (3) EP(f,g):=⟨(I−P)f,g⟩π,

where stands for the identity matrix of the appropriate size. Let the Laplacian . Then,

 EP(f,g) =∑x,y∈Ωπ(x)g(x)L(x,y)f(y) =g{T}diag(π)Lf,

where in the last line we regard , , and as (column) vectors over . In particular, if is reversible, then and

 EP(f,g) =⟨Lf,g⟩π=⟨f,L∗g⟩π=⟨f,Lg⟩π=EP(g,f) (4) =f{T}diag(π)Lg.

In this paper all Markov chains are reversible and we will most commonly use the form (4). Another common expression of the Dirichlet form for reversible is

 EP(f,g) =12∑x,y∈Ωπ(x)P(x,y)(f(x)−f(y))(g(x)−g(y)),

but we will not need this expression in this paper. It is well known that the spectral gap of

, or equivalently the smallest positive eigenvalue of

, controls the convergence rate of

. It also has a variational characterisation. Let the variance of

be

 Varπ(f):=Eπf2−(Eπf)2.

Then

 λ(P):=inf{EP(f,f)Varπ(f)∣f:Ω→R,Varπ(f)≠0}.

The usefulness of is due to the following

 (5)

where . See, for example, Levin and Peres (2017, Theorem 12.4).

The (standard) log-Sobolev inequality relates with the following entropy-like quantity:

 (6) Entπ(f):=Eπ(flogf)−EπflogEπf

for a non-negative function , where we follow the convention that . Also, always stands for the natural logarithm in this paper. The log-Sobolev constant is defined as

 α(P):=inf⎧⎪ ⎪⎨⎪ ⎪⎩EP(√f,√f)Entπ(f)∣f:Ω→R≥0,Entπ(f)≠0⎫⎪ ⎪⎬⎪ ⎪⎭.

The constant gives a better control of the mixing time of , as shown by Diaconis and Saloff-Coste (1996),

 (7)

The saving seems modest comparing to (5), but it is quite common that is exponentially small in the instance size, in which case the saving is a polynomial factor.

What we are interested in, however, is the following modified log-Sobolev constant introduced by Bobkov and Tetali (2006):

 ρ(P):=inf{EP(f,logf)Entπ(f)∣f:Ω→R≥0,Entπ(f)≠0}.

Similar to (7), we have that

 (8) tmix(P,ε)≤1ρ(P)(loglog1πmin+log12ε2),

as shown by Bobkov and Tetali (2006, Corollary 2.8).

For reversible , the following relationships among these constants are known,

 2λ(P)≥ρ(P)≥4α(P).

See, for example, Bobkov and Tetali (2006, Proposition 3.6).

Thus, lower bounds on these constants are increasingly difficult to obtain. However, to get the best asymptotic control of the mixing time, one only needs to lower bound the modified log-Sobolev constant instead of by comparing (7) and (8). Indeed, as observed by Hermon and Salez (2019), by taking the indicator function for all ,

 α(P)≤minx∈Ω{1−logπ(x)}.

In our setting of -homogeneous strongly log-concave distributions, we cannot hope for an uniform bound for similar to Theorem 1, as the right hand side of the above can be arbitrarily small for fixed .

By (3) and (6), it is clear that if we replace by for some constant , then both and increase by the same factor . Thus, in order to bound , we may further assume that . This assumption allows a simplification . Indeed, in this case, is a distribution, and is the relative entropy (or Kullback–Leibler divergence) between and .

### 2.2. Strongly log-concave distributions

We write as shorthand for , and for an index set as shorthand for .

###### Definition 2.

A polynomial with non-negative coefficients is log-concave at if its Hessian is negative semi-definite at . We call strongly log-concave if for any index set , is log-concave at the all- vector .

The notion of strong log-concavity was introduced by Gurvits (2009a, b). There are also notions of complete log-concavity introduced by Anari et al. (2018b), and Lorentzian polynomials introduced by Brändén and Huh (2019). It turns out that all three notions are equivalent. See Brändén and Huh (2019, Theorem 5.3).

The following property of strongly log-concave polynomials is particularly useful (Anari et al., 2018b; Brändén and Huh, 2019).

###### Proposition 3.

If is strongly log-concave, then for any , the Hessian matrix has at most one positive eigenvalue.

In fact, having at most one positive eigenvalue is equivalent to being negative semi-definite, but we will only need the direction above.

A distribution is called -homogeneous (or strongly log-concave) if is.

### 2.3. Matroids

A matroid is a combinatorial structure that abstracts the notion of independence. We shall define it in terms of its independent sets, although many different equivalent definitions exist. Formally, a matroid consists of a finite ground set and a collection of subsets of (independent sets) that satisfy the following:

• ;

• if , , then ;

• if and , then there exists an element such that .

The first condition guarantees that is non-empty, the second implies that is downward closed, and the third is usually called the augmentation axiom. We direct the reader to Oxley (1992) for a reference book on matroid theory. In particular, the augmentation axiom implies that all the maximal independent sets have the same cardinality, namely the rank of . The set of bases is the collection of maximal independent sets of . Furthermore, we denote by the collection of independent sets of size , where . If we dropped the augmentation axiom, the resulting structure would be a non-empty collection of subsets of that is downward closed, known as a (abstract) simplicial complex.

Brändén and Huh (2019, Theorem 7.1) showed that the support of an -homogeneous strongly log-concave distribution is the set of bases of a matroid of rank . We equip with a weight function recursively defined as follows:222One may define to be a fraction of the current definition for . This alternative definition will eliminate many factorial factors in the rest of the paper. However, it is inconsistent with the literature (Anari et al., 2018a; Kaufman and Oppenheim, 2018), so we do not adopt it.

 w(I):={π(I)Zrif |I|=r,∑I′⊃I,∣∣I′∣∣=|I|+1w(I′)if |I|

for some normalisation constant . For example, we may choose for all and , which corresponds to the uniform distribution over . It follows that

 w(I)=(r−|I|)!∑B∈B,I⊆Bw(B).

Let be the distribution over such that for . Thus . Let be the normalisation constant of . In fact, for any , .

It is straightforward to verify that for any ,

 (9) ∂Igπ(1)=∑B∈B,I⊂Bπ(B)=1Zr∑B∈B,I⊂Bw(B).

We also write as shorthand for for any .

For an independent set , the contraction is also a matroid, where . We equip with a weight function such that . We may similarly define distributions for such that for . For convenience, instead of defining over , we define it over such that for any ,

 (10) πI,k(J):={k!w(J)w(I)if I⊂J;0otherwise.

Notice that the normalising constant .

If , let be the matrix such that for any . Then notice that

 wI({u,v}) =w(I∪{u,v}) =(r−|I|−2)!∑B∈B,I∪{u,v}⊆Bw(B) (by (9)) =(r−|I|−2)!Zr⋅∂u∂v∂Igπ(1).

In other words, is multiplied by the scalar . Thus, Proposition 3 implies the following.

###### Proposition 4.

Let be an -homogeneous strongly log-concave distribution over . If and , then the matrix has at most one positive eigenvalue.

Proposition 4 implies the following bound for a quadratic form, which will be useful later.

###### Lemma 5.

Let be a function such that . Then

 f{T}WIf≤w(I).
###### Proof.

Let . The constraint implies that . Let and . Then is a real symmetric matrix. By Proposition 4, has at most one positive eigenvalue, and thus so does . We may decompose as

 (11) A=|E∖I|∑i=1λigig{T}i,

where is an orthonormal basis and for all . Moreover, notice that

is an eigenvector of

with eigenvalue . Thus, and can be taken as .

The decomposition (11) directly implies that

 W=|E∖I|∑i=1λihih{T}i,

where . In particular, . The assumption can be rewritten as . Thus,

 f{T}WIf =|E∖I|∑i=1λi⟨hi,f⟩2≤⟨h1,f⟩2=w(I),

where the inequality is due to the fact that and for all . The lemma follows. ∎

## 3. Main results

There are two natural random walks and on by starting with adding or deleting an element and coming back to . Given the current , the “up-down” random walk first chooses such that with probability proportional to , and then removes one element from uniformly at random. More formally, for and , we have that

 (12) P∧k(I,J)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩1k+1if I=J;w(I∪J)(k+1)w(I)if I∪J∈M(k+1);0otherwise.

The “down-up” random walk removes an element of uniformly at random to get , and then moves to such that with probability proportional to . More formally, for ,

 (13) P∨k(I,J)=⎧⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪⎩∑I′∈M(k−1),I′⊂Iw(I)kw(I′)if I=J;w(J)kw(I∩J)if |I∩J|=k−1;0if |I∩J|

Thus, the bases-exchange walk according to is just . The stationary distribution of both and is .

###### Theorem 6.

Let be an -homogeneous strongly log-concave distribution, and the associated matroid. Let and be defined as above on . Then the following hold:

• for any ,

• for any , .

The first part of Theorem 6 is shown by Corollary 11, and the second part by Lemma 13. Interestingly, we do not know how to directly relate with , although it is straightforward to see that both walks have the same spectral gap (see (16) and (17) below).

By (8), we have the following corollary.

###### Corollary 7.

In the same setting as Theorem 6, we have that

• for any , ;

• for any , .

In particular, for the bases-exchange walk according to ,

 tmix(PBX,π,ε)≤r(loglogπ−1min+log12ε2)

For example, for the uniform distribution over bases of matroids, Corollary 7 implies that the mixing time of the bases-exchange walk is , which improves upon the bound of Anari et al. (2018a). The mixing time bound in Corollary 7 is asymptotically optimal, as it is achieved for the bases of some matroids (Jerrum, 2003, Ex. 9.14). As mentioned in the introduction, one such example is the matroid defined by a graph which is similar to a path but with two parallel edges connecting every two successive vertices instead of a single edge. Equivalently, this can be viewed as the partition matroid where each block has two elements and each basis is formed by choosing exactly one element from every block. The rank of this matroid is , and . The Markov chain in this case is just a lazy random walk on the -dimensional Boolean hypercube, which has mixing time , matching the upper bound in Corollary 7.

## 4. The down-up walk

In this section and what follows, we always assume that the matroid and the weight function correspond to an -homogeneous strongly log-concave distribution .

We first give some basic decompositions of and . Let be a matrix whose rows are indexed by and columns by such that

 Ak(I,J):={1if I⊂J;0otherwise,

and be the vector of . Moreover, let

 (14) P↑k :=diag(wk)−1Akdiag(wk+1), (15) P↓k+1 :=1k+1A{T}k.

Then

 (16) P∧k =P↑kP↓k+1, (17) P∨k+1 =P↓k+1P↑k.

Let . Using (14) and (15), we get that

 (18) Dk+1P↓k+1=(P↑k){T}Dk.

For and a function , define for such that

 (19) f(i):=k−1∏j=iP↑jf(k).

Intuitively, is the function “pushed down” to level . The key lemma, namely Lemma 10, is that this operation contracts the relative entropy by a factor of from level to level .

We first establish some properties of .

###### Lemma 8.

Let and be a non-negative function on such that . Then we have the following:

1. for any , ,

2. for any , .

###### Proof.

For (1), we do an induction on from to . The base case of is straightforward to verify. For the induction step, suppose the claim holds for all integers larger than (). Then we have that

 f(i)(J) =P↑if(i+1)(J)=∑I∈M(i+1):I⊃Jw(I)w(J)⋅f(i+1)(I) (by IH) =∑I∈M(i+1):I⊃Jw(I)w(J)⋅EπI,k−i−1f(k) =∑I∈M(i+1):I⊃Jw(I)w(J)∑K∈M(k):K⊃I(k−i−1)!w(K)w(I)⋅f(k)(K) =∑K∈M(k):K⊃J|{I∈M(i+1):J⊂I⊂K}|⋅(k−i−1)!w(K)w(J)⋅f(k)(K) =∑K∈M(k):K⊃J(k−i)!w(K)w(J)⋅f(k)(K) =EπJ,k−if(k).

For (2), we have that

 Eπif(i) =∑J∈M(i)πi(J)EπJ,k−if(k) (by (1)) =∑J∈M(i)w(J)Zi∑K∈M(k):K⊃J(k−i)!w(K)w(J)⋅f(k)(K) =∑K∈M(k)∑J∈M(i):J⊂K(k−i)!w(K)Zi⋅f(k)(K) (as k!Zk=i!Zi) =∑K∈M(k)k!w(K)i!Zi⋅f(k)(K)=∑K∈M(k)w(K)Zk⋅f(k)(K) =Eπkf(k)=1.\qed

Now we are ready to establish the base case of the entropy’s contraction.

###### Lemma 9.

Let be a non-negative function defined on . Then

 Entπ2(f(2))≥2% Entπ1(f(1)).
###### Proof.

Without loss of generality we may assume that and therefore by (2) of Lemma 8. Note that for ,

 f(1)(v)=∑S∈M(2):v∈Sw(S)w(v)f(2)(S).

We will use the following inequality, which is valid for any and ,

 (20) alogab≥a−b.

Noticing that , we have

 Entπ2(f(2))−2Entπ1(f(1)) = ∑S∈M(2)π2(S)f(2)(S)logf(2)(S)−2∑v∈Eπ1(v)⎛⎝∑S∈M(2):v∈Sw(S)w(v)f(2)(S)⎞⎠logf(1)(v) = ∑S∈M(2)(π2(S)f(2)(S)logf(2)(S)−2∑v∈Sπ1(v)w(S)w(v)f(2)(S)logf(1)(v)) = ∑S∈M(2)(w(S)Z2f(2)(S)logf(2)(S)−2∑v∈Sw(v)Z1⋅w(S)w(v)f(2)(S)logf(1)(v)) = ∑S={u,v}∈M(2)w(S)Z2f(2)(S)(logf(2)(S)−logf(1)(v)−logf(1)(u)) ≥ ∑S={u,v}∈M(2)w(S)Z2(f(2)(S)−f(1)(v)f(1)(u)) = ∑S∈M(2)π2(S)f(2)(S)−∑S={u,v}∈M(2)w(S)Z2⋅f(1)(v)f(1)(u) = 1−12Z2⋅(f(1)){T}W∅f(1),

where the inequality is by (20) with and when , and when we have as well. Thus, the lemma follows from Lemma 5 with and . ∎

We generalise Lemma 9 as follows.

###### Lemma 10.

Let and be a non-negative function defined on . Then

 Entπk(f(k))≥kk−1Entπk−1(f(k−1)).
###### Proof.

We do an induction on . The base case of follows from Lemma 9.

For the induction step, assume the lemma holds for all integers at most for any matroid . Let be a non-negative function such that .

Recall (10), where we define over instead of over . For , and ,

 πk+1(I)=w(I)Zk+1=(k+1)⋅w(v)(k+1)!Zk+1⋅k