1 Introduction
One of the most widely studied probabilistic models in statistical physics and machine learning is the
Ising model, which is a probability distribution on the hypercube
of the formwhere are the entries of an arbitrary real, symmetric matrix with zeros on the diagonal. The distribution is also referred to as the Boltzmann distribution or Gibbs measure. The key quantity of interest is the normalizing constant
known as the partition function of the Ising model, and its logarithm, , known as the free energy. The reason these are important is that one can easily extract from them many other quantities of interest, most notably the values of the marginals (probabilities like
), phase transitions in the behavior of the distribution (e.g. existence of longrange correlations), and many others.
Although originally introduced in statistical physics, Ising models and their generalizations have also found a wide range of applications in many different areas like statistics, computer science, combinatorics and machine learning (see, e.g., the references and discussion in [Basak and Mukherjee, 2017, Borgs et al., 2012, Wainwright and Jordan, 2008]). Consequently, various different algorithmic and analytic approaches to computing and/or approximating the free energy have been developed.
We should note at the outset that the partition function is both analytically and computationally intractable: closed form expressions for the partition function are extremely hard to derive (even for the Ising model on the standard dimensional lattice), and even crudely approximating the partition function multiplicatively is NPhard, even in the case of graphs with degrees bounded by a small constant (see [Sly and Sun, 2012]).
Nevertheless, there are a plethora of approaches to approximating the partition function – both for the purposes of deriving structural results, and for designing efficient algorithms. A major group of approaches consist of socalled variational methods, which proceed by writing a variational expression for the free energy, and then modifying the resulting optimization problem in some way so as to make it tractable. More concretely, one can write the free energy using the Gibbs variational principle as
(1) 
where ranges over all probability distributions on the Boolean hypercube. This can be seen by noting that
(2) 
and recalling that with equality if and only if .
Of course, the polytope of distributions is intractable to optimize over. Two popular approaches for handling this are:

Meanfield approximation: instead of optimizing over all distributions, one optimizes over product distributions, thereby obtaining a lower bound on . In other words, we define the (meanfield) variational free energy by
Indeed, if is the optimizer in the above definition, the product distribution on the Boolean hypercube, with the coordinate having expectation minimizes among all product distributions .
This approach originated in the physics literature where it was used to great success in several cases, but from the point of view of algorithms it is a priori problematic: it’s not clear this problem is any easier to solve, as the resulting optimization problem is highly nonconvex.

Momentbased convex relaxations: instead of optimizing over distributions, one optimizes over a “relaxation” (enlarging) of the polytope of distributions, thereby obtaining an upper bound on . There are systematic ways to do this, giving rise to hierarchies of convex relaxations (see, e.g. [Barak et al., 2011]). This approach is very natural and common in theoretical computer science, since the optimization problem is convex, hence efficiently solvable, although quantifying the quality of the relaxation is usually more difficult.
A priori these two approaches seem unrelated – indeed, the way they modify the variational problem is almost opposite. In this paper, we provide a unified perspective on these two approaches: for example, we show that the tight parameter regime where meanfield approximation and SheraliAdams based approaches (even for classical MAXCSP) give nontrivial guarantees is identical.
More precisely, we prove the following results.

Simple and optimal meanfield bounds via rounding: We obtain the optimal bounds on the quality of the meanfield approximation in a simple and elegant way. In particular, we show that there is a simple rounding procedure which directly extracts a product distribution from the true Gibbs measure, and whose output is easy to analyze. More precisely, a recent result due to [Jain et al., 2018a] proves that the meanfield approximation to is within an additive error^{1}^{1}1Here, is the Frobenius norm of the matrix . of . We improve this and show:
Theorem 1.1.
Fix an Ising model on vertices. Then,
We note that [Jain et al., 2018a] prove this inequality is tight up to constants. This also recovers the result of [Basak and Mukherjee, 2017] which shows the error is when . The full results are in Section 4.

Subexponential algorithms for approximating up to the computational intractability limit: Our proof of the above theorem is algorithmic, except that it assumes access to the true Gibbs measure. To fix this, we instead apply our rounding scheme to a convex relaxation proposed by [Risteski, 2016] based on the SheraliAdams hierarchy. The algorithm we get as a result runs in subexponential time so long as ; this condition for subexponentiality is tight under GapETH. More precisely:
Theorem 1.2.
We can approximate up to an additive factor of in time if . Moreover, we can also output a product distribution achieving this approximation. On the other hand, for , it is GapETHhard to approximate up to an additive factor of in subexponential time.
We also describe how to accelerate the algorithm on dense graphs using random subsampling. The full results are in Section 7.

Optimality of correlation rounding: The rounding we use in the proof of the above theorems relies crucially on the correlation rounding technique introduced in [Barak et al., 2011]. This procedure was designed specifically to tackle dense and spectrally wellbehaved instances of constraint satisfaction problems, as well as to derive subexponential algorithms for unique games. In order to better understand the efficacy of correlation rounding, Allen, O’Donnell, and Zhou [Allen and O’Donnell, 2015] introduced a conjecture on the number of variables one needs to condition on in an arbitrary distribution, in order to guarantee that the remaining pairs of variables have average covariance at most . The current best result of [Raghavendra and Tan, 2012] gives a bound of ; [Allen and O’Donnell, 2015] conjectured that this can be decreased to . We refute this conjecture in essentially the strongest possible sense. Namely, we show:
Theorem 1.3.
There exists an absolute constant , a sequence of pairs going to infinity, and a family of probability distributions (the SK spin glass) such that for any set with ,
We prove this theorem by combining our techniques with rigorous results on the SherringtonKirkpatrick spin glass. The full results are in Section 5.

Generalization of all results to MRFs: We give natural and tight generalizations of these results to order Markov Random Fields. In general, we show that the tight regime for additive error for both meanfield and subexponential time algorithms (under GapETH) is , and show tightness of the higherorder correlation rounding guarantee. The full results are in Section 6.
2 Background and related work
2.1 The meanfield approximation
Owing to its simplicity, the mean field approximation has long been used in statistical physics (see [Parisi, 1988]
for a textbook treatment) and also in Bayesian statistics
[Anderson and Peterson, 1987, Jordan et al., 1999, Wainwright and Jordan, 2008], where it is one of the prototypical examples of a variational method. It has the attractive property that it always gives a lower bound for the free energy.The critical points of have a fixpoint interpretation as the solutions to the meanfield equation, . However, iterating this equation is known to converge to the meanfield solution only in hightemperature regimes such as Dobrushin uniqueness; as soon as we leave this regime, the iteration may fail to converge to the optimum even in simple models (CurieWeiss) – see [Jain et al., 2018a]. We explain a connection between the meanfield equation and our approach in Section 4.1 that does not rely on any hightemperature assumption.
It is well known [Ellis and Newman, 1978] that the mean field approximation is very accurate for the CurieWeiss model, which is the Ising model on the complete graph, at all temperatures. On the other hand, it is also known [Dembo and Montanari, 2010] that for very sparse graphs like trees of bounded arity, this is not the case.
In recent years, considerable effort has gone into bounding the error of the meanfield approximation on more general graphs; we refer the reader to [Basak and Mukherjee, 2017, Jain et al., 2018a] for a detailed discussion and comparison of results in this direction. If one only wishes to show that the meanfield approximation asymptotically gives the correct free energy density and does not care about the rate of convergence, then the breakthrough result is due to [Basak and Mukherjee, 2017], who provided an exponential improvement over previous work of [Borgs et al., 2012] to identify the regime where this happens.
Theorem 2.1 ([Basak and Mukherjee, 2017]).
Let be a sequence of Ising models indexed by the number of vertices. if , then .
This result is tight – there are simple examples of models with where . On the other hand, if one also cares about the rate of convergence, then this result is not the best known. Here, improving on previous bounds of [Borgs et al., 2012], [Basak and Mukherjee, 2017], and [Eldan, 2016], it was shown by [Jain et al., 2018a] that:
Theorem 2.2 ([Jain et al., 2018a]).
Fix an Ising model on vertices. Then,
As stated earlier, our first main result Theorem 1.1 removes the logarithmic term in Theorem 2.2, thereby completely subsuming both of the theorems stated above. A more general version of this theorem, valid for higherorder Markov random fields on arbitrary finite alphabets, is Theorem 6.2 below.
2.2 Algorithms for dense graphs
At first glance, the condition that
may seem a little odd. To demystify it, consider the antiferromagnetic Ising model corresponding
^{2}^{2}2The scaling here is chosen so that if the MAXCUT is edges with , then the two terms in (1) are of the same scale. to MAXCUT on a graph with edges which has for each . If is the optimum fraction of edges cut, then(3) 
so the requirement that is the same as requiring . In other words, our algorithms operate in the regime where the average degree is superconstant and the objective is to approximate MAXCUT within factor . Thus, they can be viewed as freeenergy generalizations of optimization problems on dense graphs.
We briefly survey relevant work on approximation algorithms for dense graphs. The main emphasis in the literature has been on the case when for which PTASs have been developed, for instance the weak regularity lemma based algorithm of [Frieze and Kannan, 1999], the greedy algorithms of [Mathieu and Schudy, 2008], and the SheraliAdams based approach of [de la Vega and KenyonMathieu, 2007]. On the other hand, if for any then no PTAS for even MAXCUT is possible [de la Vega and Karpinski, 2006].
The work most relevant to ours is the improved analysis of the SheraliAdams relaxation due to [Yoshida and Zhou, 2014] based on correlation rounding. Surprisingly, although there are many methods to approximate MAXCUT when as mentioned above, to our knowledge none of the algorithms except for SheraliAdams are guaranteed to give subexponential time algorithms down to ; for example, the method of [Frieze and Kannan, 1999] is only subexponential time for . The guarantee for SheraliAdams in this regime is not explicitly stated in [Yoshida and Zhou, 2014] or anywhere else, as far as we are aware, but is straightforward to show even from the correlation rounding guarantee of [Raghavendra and Tan, 2012] (see Section 7). The correct generalization of this guarantee for MAXCSP was essentially pointed out in [Fotakis et al., 2016] but once again, their algorithm misses the tight regime (achievable by SheraliAdams) by polylogarithmic factors. Our result recovers the tight regime (i.e. constraints) in this setting as well, while also generalizing to the freeenergy (see Section 7).
For computing the free energy, the two most relevant works are [Risteski, 2016] and [Jain et al., 2018a]: the first work does not make any connection to meanfield approximation and proves a slightly weaker guarantee for SheraliAdams than the current work; the second work uses a regularity based approach to compute the meanfield approximation, and gets similar guarantees to the algorithm of this work but misses the correct subexponential time regime by log factors.
2.3 Correlation rounding, and a refutation of the AllenO’DonnellZhou conjecture
Let
be a collection of jointly distributed random variables, each of which takes values in
. There are two possibilities for such a collection:
The average covariance of the collection, defined to be , is small.

The average covariance of the collection is not small: in this case, we expect a random coordinate to contain significant information about many of the other random variables in , so that we might intuitively conjecture that conditioning on the random variables for all in a ‘small’ random subset of makes the average covariance sufficiently small.
This intuition is indeed true, and has been quantitatively formalized in several works by the theoretical computer science community [Barak et al., 2011, Guruswami and Sinop, 2011, Raghavendra and Tan, 2012, Yoshida and Zhou, 2014]. We note that similar ideas have appeared independently in the statistical physics literature under the name of ‘pinning’; see e.g. [Ioffe and Velenik, 2000] and references therein, as well as in the recent work [CojaOghlan and Perkins, 2017].
Theorem 2.3 ([Raghavendra and Tan, 2012]).
Let be a collection of valued random variables, and let . Then, for some integer :
The above theorem is at the heart of the socalled correlation rounding technique for the SheraliAdams and SOS convex relaxation hierarchies, which has been used to provide stateoftheart approximation algorithms for many classic NPhard problems and their variants; we refer the reader to the references above for much more on this. As we will see below, it will also be key to our proof of Theorem 1.1.
Recently, it was conjectured by Allen, O’Donnell and Zhou [Allen and O’Donnell, 2015] that the upper bound on in Theorem 2.3 can be improved significantly. More precisely, they conjectured that:
Conjecture 2.4 (Conjecture A in [Allen and O’Donnell, 2015]).
Theorem 2.3 holds with .
Their motivation for this conjecture was twofold:

On a technical level, the proof of Theorem 2.3 in [Raghavendra and Tan, 2012] proceeds by first showing that for some integer
where denotes the mutual information between and , and then using the standard inequality ; we will present a generalized version of this proof from [Manurangsi and Raghavendra, 2017, Yoshida and Zhou, 2014] later. Essentially, they conjectured that one could surmount the quadratic loss without passing through mutual information.

From a complexitytheoretic point of view, the best lower bounds on the computational complexity of dense MAXCSP problems (such as [Ailon and Alon, 2007, Manurangsi and Raghavendra, 2017]) leave open the possibility that MAXCUT on vertices can be computed to within additive error in time , whereas the best known algorithms all require time at least . If creftype 2.4 were true, the running time of the SheraliAdams based approach would have improved to time for error (which, for dense graphs, is close to matching the lower bound of [Manurangsi and Raghavendra, 2017]).
[Allen and O’Donnell, 2015] prove creftype 2.4 for the special case when the random variables are the leaves of a certain type of information flow tree known as the caterpillar graph. In addition, [Manurangsi and Raghavendra, 2017] showed a similar improvement for correlation rounding in the MAX CSP problem, when promised that there exists an assignment satisfying all of the constraints. As described in the introduction, we use ideas from statistical physics to refute creftype 2.4 in essentially the strongest possible form by showing that Theorem 2.3 does not hold with (Theorem 1.3).
3 Technical tools
3.1 Hierarchies of convex relaxations
Computing the free energy of an Ising model has as a special case the problem MAXQP/MAX2CSP, because if we let then
(4) 
As with many other problems in combinatorial optimization, this is a maximization problems on the Boolean hypercube, i.e. as a problem of the form
These problems are often NPhard to solve exactly, but convex hierarchies give a principled way to write down a natural family of convex relaxations which are efficiently solvable and give increasingly better approximations to the true value. First, one reexpresses the problem as an optimization problem over the convex polytope of probability distributions using that
the advantage of this reformulation is that the objective is now linear in the variable . Second, one relaxes to a larger convex set of pseudodistributions which are more tractable to optimize over. The tightness of relaxation is controlled by a parameter (known as the level or number of rounds of the hierarchy); as the parameter increases, the relaxation becomes tighter with the level relaxation corresponding to the original optimization problem.
Different hierarchies correspond to different choices of the space of pseudodistributions; two of the most popular are the SheraliAdams (SA) hierarchy and the SumofSquares (SOS)/Laserre hierarchy. In the SheraliAdams hierarchy, we define a level pseudodistribution to be given by the following variables and constraints:

For every with , a valid joint distribution over .

Compatability conditions, which require that for every with and every with and , .
Observe that, by linearity, this data defines a unique pseudoexpectation operator^{3}^{3}3This operator may behave very differently from a true expectation. For example, it’s possible that for some . The SOS hierarchy is formed by additionally requiring for all lowdegree . from real polynomials of degree at most to .
Let denote the set of level pseudodistributions on the hypercube. Then for , we can write down
as a linear program with
many variables and a number of constraints which is polynomial in the number of variables. By strong duality for linear programs, we can also think of the value of the level relaxation as corresponding to the best upper bound derivable on in a limited proof system, which captures e.g. case analysis on sets of size at most .In addition to this standard setup, since the variational formulation for has an entropy term, we will need a proxy for it when we use the SheraliAdams hierarchy. The particular proxy we will use was introduced by [Risteski, 2016] – further details are in Section 7.
3.2 The correlation rounding theorem
As mentioned in the introduction, our proof of Theorem 1.1 will depend crucially on the correlation rounding theorem. Here, we present a general higherorder version of this theorem due to [Manurangsi and Raghavendra, 2017], building on previous work of [Raghavendra and Tan, 2012] and [Yoshida and Zhou, 2014].
Definition 3.1.
The multivariate total correlation of a collection of random variables is defined to be
From the definition of divergence, it follows that
By using conditional distributions/ conditional entropies, we may define the conditional multivariate total correlation in the obvious way. Note that in the twovariable case, the total correlation is the same as the mutual information .
Theorem 3.2 (Correlation rounding theorem, [Manurangsi and Raghavendra, 2017]).
Let be a collection of valued random variables. Then, for any , there exists some such that:
Remark 3.3.
The same conclusion holds for general random variables with the factor replaced by . Also, the guarantee holds for general level pseudodistributions.
For the reader’s convenience, we provide a complete proof of this result in Appendix A, correcting certain errors which have been persistent in the literature.
3.3 The SherringtonKirkpatrick model and spin glass theory
The SherringtonKirkpatrick (SK) spin glass model was introduced in [Kirkpatrick and Sherrington, 1975] as a solvable model of disordered systems. The Gibbs measure of the SK spin glass on vertices (without external field) is a random probability distribution on given by:
where are i.i.d. standard Gaussians and is a fixed parameter referred to as the inverse temperature. In [Kirkpatrick and Sherrington, 1975], a prediction, now known as the replicasymmetric prediction, was made for the limiting value of as . It was soon realized that this prediction could not be correct for all values of ; finding and understanding the correct prediction led physicists to the development of a sophisticated spin glass theory based upon the nonrigorous replica method ([Mézard et al., 1987]). In particular, physicists showed via the replica method that the SK spin glass exhibits two phases depending on the value of :

Replica Symmetry (RS, ). This is the regime where the original prediction for the limiting value of is correct. Moreover, the Gibbs measure exhibits a number of unusual properties: for example, the marginal law on any small subset of the coordinates converges to a product distribution as ([Talagrand, 2011a]).

(Full) Replica Symmetry Breaking (fRSB, ). In this phase, the limit of does not have a simple closed form; however, there is a remarkable variational expression for the limiting value known as the Parisi formula. Moreover, the Gibbs measure is understood to be shattered into exponentially many clusters with the geometry of an ultrametric space.
In the replica symmetric phase, the prediction for the limiting value of was rigorously confirmed by the work of [Aizenman et al., 1987]. Furthermore, they proved their result for general distributions of the , giving what is known as a universality result.
Theorem 3.4 ([Aizenman et al., 1987]).
Let . For the SK spin glass at inverse temperature ,
as . Moreover, this also holds if the are i.i.d. samples from any distribution with finite moments, mean
and variance
.This is the only result we will need from the spin glass literature, although much more is now rigorously known. For an account of more recent developments, including the proofs of the Parisi formula and ultrametricity conjecture, we refer the reader to the books [Panchenko, 2013, Talagrand, 2011a, Talagrand, 2011b].
4 Meanfield approximation via correlation rounding: proof of Theorem 1.1
First we recall a couple of lemmas which are essentially used in all works on correlation rounding. Recall that for two probability distributions and on the same finite space , the total variation distance between and is defined by .
Lemma 4.1 (Lemma 5.1, [Barak et al., 2011]).
Let and be jointly distributed random variables valued in . Let denote the marginal distributions of and , and let denote their joint distribution. Then,
From this, one can observe the following consequence of correlation rounding:
Lemma 4.2.
Let be a collection of valued random variables. Then, for any , there exists some with such that:
Proof.
This is standard and we include the proof for completeness. We begin by applying Theorem 3.2 with ; let denote the resulting set of size at most . By Pinsker’s inequality, we have
for any . Therefore, by taking the expectation on both sides, we get:
By averaging over the choice of , we get
where the second inequality follows by the choice of and Theorem 3.2. Finally, Lemma 4.1 shows that for any ,
from which we obtain our desired conclusion:
(5) 
∎
Finally, we recall the maximumentropy principle characterizing product distributions:
Lemma 4.3.
Let denote a probability distribution on the finite space . Let denote the product distribution on whose marginal distribution on is the same as that of for all . Then, .
Proof.
We are now ready to prove Theorem 1.1.
Proof of Theorem 1.1.
Let be some parameter which will be optimized later. We begin by applying Lemma 4.2 with (for clarity of exposition, we will omit floors and ceilings since they do not make any essential difference); let denote the resulting set of size at most . Let denote the Boltzmann distribution, and recall that the Gibbs variational principle Eq. 1 states that
Let denote the product distribution on for which . Then, using the chain rule for entropy, we see that
where in the fourth line, we have used that , and in the last line, we have used Lemma 4.3. From Lemma 4.2 and the CauchySchwarz inequality, it follows that
To summarize, we have shown that
In particular, there exists some choice of , such that with , we have
Finally, by setting we get the desired conclusion:
∎
Remark 4.4.
For the choice of in the above proof to make sense, we require that , which translates to . However, the above bound also holds if since in this case, our error term equals , whereas there is a trivial upper bound of on , obtained by considering the product distribution supported at the point .
4.1 Aside: correlation rounding and the meanfield equation
The above proof shows that for the product measure , is close to . This shows indirectly, by considering the maximizer of , that there exists a product distribution with marginals that are an exact solution to the meanfield equation which is close to the Gibbs distribution in distance. In this subsection, we show that the marginals output by correlation rounding are already an approximate solution to the meanfield equation, given slightly stronger assumptions on . This will follow by showing that the variance of the local fields is greatly reduced by conditioning. We will not need this result henceforth, but this structural result may be of independent interest.
First, we show that the error in the meanfield equations is bounded by the variance of the local field:
Lemma 4.5.
Let be the spins of an Ising model with interaction matrix . Then for any ,
Proof.
Since , we know that . Therefore,
by the triangle inequality, the Lipschitz property of , and Jensen’s inequality. ∎
Second, we can bound the average variance of the local fields by the average covariance. Recall that the Schatten 4norm of is given by .
Lemma 4.6.
Let be arbitrary random variables, and suppose are the rows of a symmetric matrix with zeros on the diagonal. Then
Proof.
By expanding out the variance and applying the CauchySchwarz inequality, we find that
Recalling the definition of the Schatten 4norm gives the result. ∎
Finally, correlation rounding controls the average covariance, giving us our desired result – after conditioning, the marginals approximately satisfy the meanfield equation.
Theorem 4.7.
Let be the spins of an Ising model with interaction matrix . Fix and let be the set given by Lemma 4.2. Let . Then
Proof.
The same proof shows the meanfield equation holds approximately in the presence of external field, after conditioning.
5 Correlation rounding is tight for spin glasses: proof of Theorem 1.3
We define the following universal constant, which we already know an upper bound on by Theorem 2.3:
If creftype 2.4 were true, then we would have – indeed, the conjecture says that the expected conditional covariance decays like , even for a random choice of the conditioning set . We will instead show an explicit positive lower bound on , thereby disproving the conjecture.
We begin by proving a variant of Theorem 1.1, which gives a bound on the error of the meanfield approximation in terms of the constant .
Lemma 5.1.
Let be a sequence of Ising models indexed by the number of vertices. Let (resp. ) denote the free energy (resp. variational free energy) of . Suppose that . Then,
Proof.
Let be a sequence of natural numbers going to infinity, which will be specified later; our choice will be such that for all . For the Ising model , let
and let denote the minimum value i.e. the value of the objective corresponding to . By repeating the first part of the proof of Theorem 1.1, we get
As opposed to the proof of Theorem 1.1 where we used the CauchySchwarz inequality, here we simply estimate the first term by
Finally, set
note that for all sufficiently large by assumption, along with the fact that . It follows that for all sufficiently large,
dividing both sides by , taking the as , and using yields the desired conclusion. ∎
To complete the proof of Theorem 1.3, we will exhibit a sequence of Ising models for which is finite and is positive. Specifically, we will show that this is true for a ‘typical’ growing sequence of the Rademacher SKspin glass. First, we need the following lemma.
Lemma 5.2.
Fix . Let denote the (random) free energy of the SK spin glass on vertices with parameter , and let denote its variational free energy. Then,
asymptotically almost surely (a.a.s) i.e. with probability going to as . This holds under the same universality regime as Theorem 3.4.
Proof.
We prove this by calculating and