One of the most widely studied probabilistic models in statistical physics and machine learning is theIsing model
, which is a probability distribution on the hypercubeof the form
where are the entries of an arbitrary real, symmetric matrix with zeros on the diagonal. The distribution is also referred to as the Boltzmann distribution or Gibbs measure. The key quantity of interest is the normalizing constant
known as the partition function of the Ising model, and its logarithm, , known as the free energy. The reason these are important is that one can easily extract from them many other quantities of interest, most notably the values of the marginals (probabilities like
), phase transitions in the behavior of the distribution (e.g. existence of long-range correlations), and many others.
Although originally introduced in statistical physics, Ising models and their generalizations have also found a wide range of applications in many different areas like statistics, computer science, combinatorics and machine learning (see, e.g., the references and discussion in [Basak and Mukherjee, 2017, Borgs et al., 2012, Wainwright and Jordan, 2008]). Consequently, various different algorithmic and analytic approaches to computing and/or approximating the free energy have been developed.
We should note at the outset that the partition function is both analytically and computationally intractable: closed form expressions for the partition function are extremely hard to derive (even for the Ising model on the standard -dimensional lattice), and even crudely approximating the partition function multiplicatively is NP-hard, even in the case of graphs with degrees bounded by a small constant (see [Sly and Sun, 2012]).
Nevertheless, there are a plethora of approaches to approximating the partition function – both for the purposes of deriving structural results, and for designing efficient algorithms. A major group of approaches consist of so-called variational methods, which proceed by writing a variational expression for the free energy, and then modifying the resulting optimization problem in some way so as to make it tractable. More concretely, one can write the free energy using the Gibbs variational principle as
where ranges over all probability distributions on the Boolean hypercube. This can be seen by noting that
and recalling that with equality if and only if .
Of course, the polytope of distributions is intractable to optimize over. Two popular approaches for handling this are:
Mean-field approximation: instead of optimizing over all distributions, one optimizes over product distributions, thereby obtaining a lower bound on . In other words, we define the (mean-field) variational free energy by
Indeed, if is the optimizer in the above definition, the product distribution on the Boolean hypercube, with the coordinate having expectation minimizes among all product distributions .
This approach originated in the physics literature where it was used to great success in several cases, but from the point of view of algorithms it is a priori problematic: it’s not clear this problem is any easier to solve, as the resulting optimization problem is highly non-convex.
Moment-based convex relaxations: instead of optimizing over distributions, one optimizes over a “relaxation” (enlarging) of the polytope of distributions, thereby obtaining an upper bound on . There are systematic ways to do this, giving rise to hierarchies of convex relaxations (see, e.g. [Barak et al., 2011]). This approach is very natural and common in theoretical computer science, since the optimization problem is convex, hence efficiently solvable, although quantifying the quality of the relaxation is usually more difficult.
A priori these two approaches seem unrelated – indeed, the way they modify the variational problem is almost opposite. In this paper, we provide a unified perspective on these two approaches: for example, we show that the tight parameter regime where mean-field approximation and Sherali-Adams based approaches (even for classical MAX--CSP) give nontrivial guarantees is identical.
More precisely, we prove the following results.
Simple and optimal mean-field bounds via rounding: We obtain the optimal bounds on the quality of the mean-field approximation in a simple and elegant way. In particular, we show that there is a simple rounding procedure which directly extracts a product distribution from the true Gibbs measure, and whose output is easy to analyze. More precisely, a recent result due to [Jain et al., 2018a] proves that the mean-field approximation to is within an additive error111Here, is the Frobenius norm of the matrix . of . We improve this and show:
Fix an Ising model on vertices. Then,
Subexponential algorithms for approximating up to the computational intractability limit: Our proof of the above theorem is algorithmic, except that it assumes access to the true Gibbs measure. To fix this, we instead apply our rounding scheme to a convex relaxation proposed by [Risteski, 2016] based on the Sherali-Adams hierarchy. The algorithm we get as a result runs in subexponential time so long as ; this condition for subexponentiality is tight under Gap-ETH. More precisely:
We can approximate up to an additive factor of in time if . Moreover, we can also output a product distribution achieving this approximation. On the other hand, for , it is Gap-ETH-hard to approximate up to an additive factor of in subexponential time.
We also describe how to accelerate the algorithm on dense graphs using random subsampling. The full results are in Section 7.
Optimality of correlation rounding: The rounding we use in the proof of the above theorems relies crucially on the correlation rounding technique introduced in [Barak et al., 2011]. This procedure was designed specifically to tackle dense and spectrally well-behaved instances of constraint satisfaction problems, as well as to derive subexponential algorithms for unique games. In order to better understand the efficacy of correlation rounding, Allen, O’Donnell, and Zhou [Allen and O’Donnell, 2015] introduced a conjecture on the number of variables one needs to condition on in an arbitrary distribution, in order to guarantee that the remaining pairs of variables have average covariance at most . The current best result of [Raghavendra and Tan, 2012] gives a bound of ; [Allen and O’Donnell, 2015] conjectured that this can be decreased to . We refute this conjecture in essentially the strongest possible sense. Namely, we show:
There exists an absolute constant , a sequence of pairs going to infinity, and a family of probability distributions (the SK spin glass) such that for any set with ,
We prove this theorem by combining our techniques with rigorous results on the Sherrington-Kirkpatrick spin glass. The full results are in Section 5.
Generalization of all results to -MRFs: We give natural and tight generalizations of these results to order Markov Random Fields. In general, we show that the tight regime for additive error for both mean-field and sub-exponential time algorithms (under Gap-ETH) is , and show tightness of the higher-order correlation rounding guarantee. The full results are in Section 6.
2 Background and related work
2.1 The mean-field approximation
Owing to its simplicity, the mean field approximation has long been used in statistical physics (see [Parisi, 1988]
for a textbook treatment) and also in Bayesian statistics[Anderson and Peterson, 1987, Jordan et al., 1999, Wainwright and Jordan, 2008], where it is one of the prototypical examples of a variational method. It has the attractive property that it always gives a lower bound for the free energy.
The critical points of have a fixpoint interpretation as the solutions to the mean-field equation, . However, iterating this equation is known to converge to the mean-field solution only in high-temperature regimes such as Dobrushin uniqueness; as soon as we leave this regime, the iteration may fail to converge to the optimum even in simple models (Curie-Weiss) – see [Jain et al., 2018a]. We explain a connection between the mean-field equation and our approach in Section 4.1 that does not rely on any high-temperature assumption.
It is well known [Ellis and Newman, 1978] that the mean field approximation is very accurate for the Curie-Weiss model, which is the Ising model on the complete graph, at all temperatures. On the other hand, it is also known [Dembo and Montanari, 2010] that for very sparse graphs like trees of bounded arity, this is not the case.
In recent years, considerable effort has gone into bounding the error of the mean-field approximation on more general graphs; we refer the reader to [Basak and Mukherjee, 2017, Jain et al., 2018a] for a detailed discussion and comparison of results in this direction. If one only wishes to show that the mean-field approximation asymptotically gives the correct free energy density and does not care about the rate of convergence, then the breakthrough result is due to [Basak and Mukherjee, 2017], who provided an exponential improvement over previous work of [Borgs et al., 2012] to identify the regime where this happens.
Theorem 2.1 ([Basak and Mukherjee, 2017]).
Let be a sequence of Ising models indexed by the number of vertices. if , then .
This result is tight – there are simple examples of models with where . On the other hand, if one also cares about the rate of convergence, then this result is not the best known. Here, improving on previous bounds of [Borgs et al., 2012], [Basak and Mukherjee, 2017], and [Eldan, 2016], it was shown by [Jain et al., 2018a] that:
Theorem 2.2 ([Jain et al., 2018a]).
Fix an Ising model on vertices. Then,
2.2 Algorithms for dense graphs
At first glance, the condition that
may seem a little odd. To demystify it, consider the anti-ferromagnetic Ising model corresponding222The scaling here is chosen so that if the MAX-CUT is edges with , then the two terms in (1) are of the same scale. to MAX-CUT on a graph with edges which has for each . If is the optimum fraction of edges cut, then
so the requirement that is the same as requiring . In other words, our algorithms operate in the regime where the average degree is super-constant and the objective is to approximate MAX-CUT within factor . Thus, they can be viewed as free-energy generalizations of optimization problems on dense graphs.
We briefly survey relevant work on approximation algorithms for dense graphs. The main emphasis in the literature has been on the case when for which PTASs have been developed, for instance the weak regularity lemma based algorithm of [Frieze and Kannan, 1999], the greedy algorithms of [Mathieu and Schudy, 2008], and the Sherali-Adams based approach of [de la Vega and Kenyon-Mathieu, 2007]. On the other hand, if for any then no PTAS for even MAX-CUT is possible [de la Vega and Karpinski, 2006].
The work most relevant to ours is the improved analysis of the Sherali-Adams relaxation due to [Yoshida and Zhou, 2014] based on correlation rounding. Surprisingly, although there are many methods to approximate MAX-CUT when as mentioned above, to our knowledge none of the algorithms except for Sherali-Adams are guaranteed to give sub-exponential time algorithms down to ; for example, the method of [Frieze and Kannan, 1999] is only sub-exponential time for . The guarantee for Sherali-Adams in this regime is not explicitly stated in [Yoshida and Zhou, 2014] or anywhere else, as far as we are aware, but is straightforward to show even from the correlation rounding guarantee of [Raghavendra and Tan, 2012] (see Section 7). The correct generalization of this guarantee for MAX--CSP was essentially pointed out in [Fotakis et al., 2016] but once again, their algorithm misses the tight regime (achievable by Sherali-Adams) by poly-logarithmic factors. Our result recovers the tight regime (i.e. constraints) in this setting as well, while also generalizing to the free-energy (see Section 7).
For computing the free energy, the two most relevant works are [Risteski, 2016] and [Jain et al., 2018a]: the first work does not make any connection to mean-field approximation and proves a slightly weaker guarantee for Sherali-Adams than the current work; the second work uses a regularity based approach to compute the mean-field approximation, and gets similar guarantees to the algorithm of this work but misses the correct sub-exponential time regime by log factors.
2.3 Correlation rounding, and a refutation of the Allen-O’Donnell-Zhou conjecture
Let. There are two possibilities for such a collection:
The average covariance of the collection, defined to be , is small.
The average covariance of the collection is not small: in this case, we expect a random coordinate to contain significant information about many of the other random variables in , so that we might intuitively conjecture that conditioning on the random variables for all in a ‘small’ random subset of makes the average covariance sufficiently small.
This intuition is indeed true, and has been quantitatively formalized in several works by the theoretical computer science community [Barak et al., 2011, Guruswami and Sinop, 2011, Raghavendra and Tan, 2012, Yoshida and Zhou, 2014]. We note that similar ideas have appeared independently in the statistical physics literature under the name of ‘pinning’; see e.g. [Ioffe and Velenik, 2000] and references therein, as well as in the recent work [Coja-Oghlan and Perkins, 2017].
Theorem 2.3 ([Raghavendra and Tan, 2012]).
Let be a collection of -valued random variables, and let . Then, for some integer :
The above theorem is at the heart of the so-called correlation rounding technique for the Sherali-Adams and SOS convex relaxation hierarchies, which has been used to provide state-of-the-art approximation algorithms for many classic NP-hard problems and their variants; we refer the reader to the references above for much more on this. As we will see below, it will also be key to our proof of Theorem 1.1.
Their motivation for this conjecture was twofold:
where denotes the mutual information between and , and then using the standard inequality ; we will present a generalized version of this proof from [Manurangsi and Raghavendra, 2017, Yoshida and Zhou, 2014] later. Essentially, they conjectured that one could surmount the quadratic loss without passing through mutual information.
From a complexity-theoretic point of view, the best lower bounds on the computational complexity of dense MAX-CSP problems (such as [Ailon and Alon, 2007, Manurangsi and Raghavendra, 2017]) leave open the possibility that MAX-CUT on vertices can be computed to within additive error in time , whereas the best known algorithms all require time at least . If creftype 2.4 were true, the running time of the Sherali-Adams based approach would have improved to time for error (which, for dense graphs, is close to matching the lower bound of [Manurangsi and Raghavendra, 2017]).
[Allen and O’Donnell, 2015] prove creftype 2.4 for the special case when the random variables are the leaves of a certain type of information flow tree known as the caterpillar graph. In addition, [Manurangsi and Raghavendra, 2017] showed a similar improvement for correlation rounding in the MAX -CSP problem, when promised that there exists an assignment satisfying all of the constraints. As described in the introduction, we use ideas from statistical physics to refute creftype 2.4 in essentially the strongest possible form by showing that Theorem 2.3 does not hold with (Theorem 1.3).
3 Technical tools
3.1 Hierarchies of convex relaxations
Computing the free energy of an Ising model has as a special case the problem MAX-QP/MAX-2CSP, because if we let then
As with many other problems in combinatorial optimization, this is a maximization problems on the Boolean hypercube, i.e. as a problem of the form
These problems are often NP-hard to solve exactly, but convex hierarchies give a principled way to write down a natural family of convex relaxations which are efficiently solvable and give increasingly better approximations to the true value. First, one re-expresses the problem as an optimization problem over the convex polytope of probability distributions using that
the advantage of this reformulation is that the objective is now linear in the variable . Second, one relaxes to a larger convex set of pseudo-distributions which are more tractable to optimize over. The tightness of relaxation is controlled by a parameter (known as the level or number of rounds of the hierarchy); as the parameter increases, the relaxation becomes tighter with the level relaxation corresponding to the original optimization problem.
Different hierarchies correspond to different choices of the space of pseudo-distributions; two of the most popular are the Sherali-Adams (SA) hierarchy and the Sum-of-Squares (SOS)/Laserre hierarchy. In the Sherali-Adams hierarchy, we define a level -pseudodistribution to be given by the following variables and constraints:
For every with , a valid joint distribution over .
Compatability conditions, which require that for every with and every with and , .
Observe that, by linearity, this data defines a unique pseudo-expectation operator333This operator may behave very differently from a true expectation. For example, it’s possible that for some . The SOS hierarchy is formed by additionally requiring for all low-degree . from real polynomials of degree at most to .
Let denote the set of level -pseudodistributions on the hypercube. Then for , we can write down
as a linear program withmany variables and a number of constraints which is polynomial in the number of variables. By strong duality for linear programs, we can also think of the value of the level relaxation as corresponding to the best upper bound derivable on in a limited proof system, which captures e.g. case analysis on sets of size at most .
3.2 The correlation rounding theorem
As mentioned in the introduction, our proof of Theorem 1.1 will depend crucially on the correlation rounding theorem. Here, we present a general higher-order version of this theorem due to [Manurangsi and Raghavendra, 2017], building on previous work of [Raghavendra and Tan, 2012] and [Yoshida and Zhou, 2014].
The multivariate total correlation of a collection of random variables is defined to be
From the definition of divergence, it follows that
By using conditional distributions/ conditional entropies, we may define the conditional multivariate total correlation in the obvious way. Note that in the two-variable case, the total correlation is the same as the mutual information .
Theorem 3.2 (Correlation rounding theorem, [Manurangsi and Raghavendra, 2017]).
Let be a collection of -valued random variables. Then, for any , there exists some such that:
The same conclusion holds for general random variables with the factor replaced by . Also, the guarantee holds for general level -pseudodistributions.
For the reader’s convenience, we provide a complete proof of this result in Appendix A, correcting certain errors which have been persistent in the literature.
3.3 The Sherrington-Kirkpatrick model and spin glass theory
The Sherrington-Kirkpatrick (SK) spin glass model was introduced in [Kirkpatrick and Sherrington, 1975] as a solvable model of disordered systems. The Gibbs measure of the SK spin glass on vertices (without external field) is a random probability distribution on given by:
where are i.i.d. standard Gaussians and is a fixed parameter referred to as the inverse temperature. In [Kirkpatrick and Sherrington, 1975], a prediction, now known as the replica-symmetric prediction, was made for the limiting value of as . It was soon realized that this prediction could not be correct for all values of ; finding and understanding the correct prediction led physicists to the development of a sophisticated spin glass theory based upon the non-rigorous replica method ([Mézard et al., 1987]). In particular, physicists showed via the replica method that the SK spin glass exhibits two phases depending on the value of :
Replica Symmetry (RS, ). This is the regime where the original prediction for the limiting value of is correct. Moreover, the Gibbs measure exhibits a number of unusual properties: for example, the marginal law on any small subset of the coordinates converges to a product distribution as ([Talagrand, 2011a]).
(Full) Replica Symmetry Breaking (fRSB, ). In this phase, the limit of does not have a simple closed form; however, there is a remarkable variational expression for the limiting value known as the Parisi formula. Moreover, the Gibbs measure is understood to be shattered into exponentially many clusters with the geometry of an ultrametric space.
In the replica symmetric phase, the prediction for the limiting value of was rigorously confirmed by the work of [Aizenman et al., 1987]. Furthermore, they proved their result for general distributions of the , giving what is known as a universality result.
Theorem 3.4 ([Aizenman et al., 1987]).
Let . For the SK spin glass at inverse temperature ,
as . Moreover, this also holds if the are i.i.d. samples from any distribution with finite moments, mean and variance
This is the only result we will need from the spin glass literature, although much more is now rigorously known. For an account of more recent developments, including the proofs of the Parisi formula and ultrametricity conjecture, we refer the reader to the books [Panchenko, 2013, Talagrand, 2011a, Talagrand, 2011b].
4 Mean-field approximation via correlation rounding: proof of Theorem 1.1
First we recall a couple of lemmas which are essentially used in all works on correlation rounding. Recall that for two probability distributions and on the same finite space , the total variation distance between and is defined by .
Lemma 4.1 (Lemma 5.1, [Barak et al., 2011]).
Let and be jointly distributed random variables valued in . Let denote the marginal distributions of and , and let denote their joint distribution. Then,
From this, one can observe the following consequence of correlation rounding:
Let be a collection of -valued random variables. Then, for any , there exists some with such that:
This is standard and we include the proof for completeness. We begin by applying Theorem 3.2 with ; let denote the resulting set of size at most . By Pinsker’s inequality, we have
for any . Therefore, by taking the expectation on both sides, we get:
By averaging over the choice of , we get
from which we obtain our desired conclusion:
Finally, we recall the maximum-entropy principle characterizing product distributions:
Let denote a probability distribution on the finite space . Let denote the product distribution on whose marginal distribution on is the same as that of for all . Then, .
We are now ready to prove Theorem 1.1.
Proof of Theorem 1.1.
Let be some parameter which will be optimized later. We begin by applying Lemma 4.2 with (for clarity of exposition, we will omit floors and ceilings since they do not make any essential difference); let denote the resulting set of size at most . Let denote the Boltzmann distribution, and recall that the Gibbs variational principle Eq. 1 states that
Let denote the product distribution on for which . Then, using the chain rule for entropy, we see that
To summarize, we have shown that
In particular, there exists some choice of , such that with , we have
Finally, by setting we get the desired conclusion:
For the choice of in the above proof to make sense, we require that , which translates to . However, the above bound also holds if since in this case, our error term equals , whereas there is a trivial upper bound of on , obtained by considering the product distribution supported at the point .
4.1 Aside: correlation rounding and the mean-field equation
The above proof shows that for the product measure , is close to . This shows indirectly, by considering the maximizer of , that there exists a product distribution with marginals that are an exact solution to the mean-field equation which is close to the Gibbs distribution in distance. In this subsection, we show that the marginals output by correlation rounding are already an approximate solution to the mean-field equation, given slightly stronger assumptions on . This will follow by showing that the variance of the local fields is greatly reduced by conditioning. We will not need this result henceforth, but this structural result may be of independent interest.
First, we show that the error in the mean-field equations is bounded by the variance of the local field:
Let be the spins of an Ising model with interaction matrix . Then for any ,
Since , we know that . Therefore,
by the triangle inequality, the Lipschitz property of , and Jensen’s inequality. ∎
Second, we can bound the average variance of the local fields by the average covariance. Recall that the Schatten 4-norm of is given by .
Let be arbitrary random variables, and suppose are the rows of a symmetric matrix with zeros on the diagonal. Then
By expanding out the variance and applying the Cauchy-Schwarz inequality, we find that
Recalling the definition of the Schatten 4-norm gives the result. ∎
Finally, correlation rounding controls the average covariance, giving us our desired result – after conditioning, the marginals approximately satisfy the mean-field equation.
Let be the spins of an Ising model with interaction matrix . Fix and let be the set given by Lemma 4.2. Let . Then
The same proof shows the mean-field equation holds approximately in the presence of external field, after conditioning.
5 Correlation rounding is tight for spin glasses: proof of Theorem 1.3
We define the following universal constant, which we already know an upper bound on by Theorem 2.3:
If creftype 2.4 were true, then we would have – indeed, the conjecture says that the expected conditional covariance decays like , even for a random choice of the conditioning set . We will instead show an explicit positive lower bound on , thereby disproving the conjecture.
We begin by proving a variant of Theorem 1.1, which gives a bound on the error of the mean-field approximation in terms of the constant .
Let be a sequence of Ising models indexed by the number of vertices. Let (resp. ) denote the free energy (resp. variational free energy) of . Suppose that . Then,
Let be a sequence of natural numbers going to infinity, which will be specified later; our choice will be such that for all . For the Ising model , let
and let denote the minimum value i.e. the value of the objective corresponding to . By repeating the first part of the proof of Theorem 1.1, we get
As opposed to the proof of Theorem 1.1 where we used the Cauchy-Schwarz inequality, here we simply estimate the first term by
note that for all sufficiently large by assumption, along with the fact that . It follows that for all sufficiently large,
dividing both sides by , taking the as , and using yields the desired conclusion. ∎
To complete the proof of Theorem 1.3, we will exhibit a sequence of Ising models for which is finite and is positive. Specifically, we will show that this is true for a ‘typical’ growing sequence of the Rademacher SK-spin glass. First, we need the following lemma.
Fix . Let denote the (random) free energy of the SK spin glass on vertices with parameter , and let denote its variational free energy. Then,