Approximating Partition Functions in Constant Time

11/05/2017 ∙ by Vishesh Jain, et al. ∙ MIT 0

We study approximations of the partition function of dense graphical models. Partition functions of graphical models play a fundamental role is statistical physics, in statistics and in machine learning. Two of the main methods for approximating the partition function are Markov Chain Monte Carlo and Variational Methods. An impressive body of work in mathematics, physics and theoretical computer science provides conditions under which Markov Chain Monte Carlo methods converge in polynomial time. These methods often lead to polynomial time approximation algorithms for the partition function in cases where the underlying model exhibits correlation decay. There are very few theoretical guarantees for the performance of variational methods. One exception is recent results by Risteski (2016) who considered dense graphical models and showed that using variational methods, it is possible to find an O(ϵ n) additive approximation to the log partition function in time n^O(1/ϵ^2) even in a regime where correlation decay does not hold. We show that under essentially the same conditions, an O(ϵ n) additive approximation of the log partition function can be found in constant time, independent of n. In particular, our results cover dense Ising and Potts models as well as dense graphical models with k-wise interaction. They also apply for low threshold rank models. To the best of our knowledge, our results are the first to give a constant time approximation to log partition functions and the first to use the algorithmic regularity lemma for estimating partition functions. As an application of our results we derive a constant time algorithm for approximating the magnetization of Ising and Potts model on dense graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

One of the major algorithmic tasks in the areas of Markov Chain Monte Carlo, in statistical inference and in machine learning is approximating partition functions of graphical models. The partition function or the approximate partition function are used in computing marginals and posteriors - two of the basic inference tasks in graphical models. Moreover, there is an intimate connection between computing the partition function and Markov Chain Monte Carlo methods. In particular, Jerrum and Sinclair [20] showed that it is possible to approximate the partition function for “self-reducible” models for which a rapidly mixing Markov chain exists. On the other hand, for such models, a approximation of the partition function results in a rapidly mixing chain. Some key results in the theory of MCMC provide conditions for the existence of a rapidly mixing chains and therefore allow for efficient approximations of the partition functions e.g. [12, 13, 14] and follow up work.

A different line of work is based on the fact that the measure associated to any graphical model can be characterized by a variational problem. Thus, approximating the partition function can be achieved by solving optimization problems (e.g. [4], [21], [22]). Until recently, there were very few general guarantees for the performance of variational methods. One exception is results by Risteski [18] who showed that using variational methods, it is possible to find an additive approximation to the log partition function of dense graphical models in time , even in an interesting regime where correlation decay does not hold. Here, denotes the total weight of all interactions of the graph so that we are guaranteed an a priori upper bound on the log partition function being .

Following Risteski [18], we consider dense graphical models and show that it is possible to find an additive approximation to the log partition function in (randomized) constant time. Thus our main result stated informally is:

Theorem 1.1.

For dense graphical models, it possible to compute the log partition function with additive error in time that depends only on and the density parameter.

We note that the approximation guarantee in Theorem 1.1 is rather weak. For most inference tasks, one is interested in good approximation of the partition function itself and our results (like those of [18]) only provide an approximation of the partition function within a multiplicative factor. Such an approximation is not useful for most inference and marginalization tasks. We note however that

  • It is easy to see that in constant time, it is impossible to get a sub-exponential approximation of the partition function (see Theorem 2.12 for a formal statement).

  • There are features of graphical models that are reflected by the log partition function. In particular, we show in Theorem 2.11 how to utilize our results to obtain an -approximation for the magnetization per-site for Ising models in constant time. Similar statements hold for other dense graphical models as well. The exact statement of  Theorem 2.11 provides the magnetization for a model that has parameters close to the model we are interested in. We show that this is necessary as in constant time, it is impossible to break symmetry between phases.

We note that our results hold for very general Markov random fields. On the other hand, even in the case of ferromagnetic Ising models, only polynomial time guarantees have been previously provided (see [13] and recent work [3, 15]).

We also note that our results can be thought as generalizing classical results about the approximability of Max-Cut and similar problems on dense graphs – in fact, the Max-Cut problem is just the special case of computing the log partition function of an antiferromagnetic Ising model (i.e. all non-zero entries of are negative) on a graph with equal edge weights in the limit as (so that the entropy term becomes negligible). Note that in this case, the log-partition function is (consider choosing the

according to independent Rademacher random variables) so that an

-additive approximation actually gives a PTAS. The first PTAS for Max-Cut on dense graphs was given by Arora, Karger and Karpinski [2] but ran in time ; this was improved to by Frieze and Kannan in [7] using their weak regularity lemma, and remains essentially the fastest runtime known (our algorithm matches this runtime). More recent works [1, 17] have shown that the sample complexity of the problem – the number of entries of the adjacency matrix which need to be probed – is only . In our case, determining the correct sample complexity remains an interesting open problem (see Section 2.4). It is also interesting to note that in the optimization case, algorithms based on regularity and algorithms based on convex programming hierarchies (see [5, 23]) have the same dependence on (but not ) – our result and [18] also have the same dependence on and , showing that this phenomena extends to counting-type problems as well, and suggesting that it will be difficult to beat a runtime of .

2 Overview of results

2.1 Notation and definitions

We will consider Ising models where the spins are valued. The interactions can be both ferromagnetic and anti-ferromagnetic and of varying strengths. For simplicity, we primarily restrict ourselves to the case where there are no external fields, though it will be clear that our methods extend to that case as well (Remark 2.8).

Definition 2.1.

An Ising model

is a probability distribution on the discrete cube

of the form

where the collection are the entries of an arbitrary real, symmetric matrix. The normalizing constant is called the partition function of the Ising model.

We will often need to consider an matrix as an

-dimensional vector (exactly

how we arrange the entries of in a single vector will not matter). When we do so, we will denote the resulting vector by for clarity.

Definition 2.2.

For a vector and , the norm of , denoted by , is defined by . We also define .

By viewing a matrix as a vector , these vector norms will give rise to matrix norms which we will use in the sequel. The cases will be of particular importance to us.

Another matrix norm that we will need arises from viewing an matrix as a linear operator from to via matrix multiplication.

Definition 2.3.

For an matrix , we define .

Following [18], we will focus on -dense Ising models.

Definition 2.4.

An Ising model is -dense if .

From a combinatorial perspective -dense models are a generalization of dense graph models. From a statistical physics perspective, they generalize complete graph models such as the Curie-Weiss model and the SK model [19].

Our main results will provide algorithms which run in constant time. In order to provide such guarantees on problems with unbounded input size, we will work under the usual assumptions on the computational model for sub-linear algorithms (as in e.g. [1, 8, 11]). Thus, we can probe matrix entry in time. Also note that by the standard Chernoff bounds, it follows that for any set of vertices for which we can test membership in time, we can also estimate to additive error w.h.p. in constant time using samples. This approximation will always suffice for us; so, for the sake of clarity in exposition, we will henceforth ignore this technical detail and just assume that we have access to (as in e.g. [8]).

2.2 Main results

Our main theorem for -dense Ising models is:

Theorem 2.5.

There exists a universal constant such that for any , and for any -dense Ising model on nodes with there is an algorithm running in time which computes an additive approximation to with probability at least .

Our methods extend in a straightforward manner to higher order Markov random fields on general finite alphabets as well. We will review the definitions in Section 7.1. For simplicity, we only state the result for the case of binary -uniform Markov random fields.

Theorem 2.6.

Fix . There exists a universal constant such that for any , and for any -dense binary -uniform Markov random field on with there is an algorithm running in time which computes an additive approximation to with probability at least .

In the previous theorem, it is possible to improve the dependence on at the expense of introducing a factor of in the running time. We have:

Theorem 2.7.

Fix . There exists a universal constant such that for any , and for any -dense binary -uniform Markov random field on with there is an algorithm running in time which computes an additive approximation to with probability at least .

Remark 2.8.

Although for simplicity, we stated Theorem 2.6 and Theorem 2.7 only for -uniform -dense Markov random fields, it is immediately seen that the results extend to general -dense Markov random fields of order by simply applying Theorem 7.5 or Theorem 7.6 to each , . Again, see Section 7.1 for definitions. In particular, this directly handles dense Ising models with external fields.

Remark 2.9.

The error term in Theorem 2.5 has two terms and . If , then and therefore, . Similar statements hold for the other theorems as well. Thus, in interesting applications of the theorem, we may assume WLOG that the dominant term is (by dominant here, we mean it is at least a small fraction of the other term).

As in [18], we are also able to handle the case of Ising models whose interaction matrices have low threshold ranks, although not in constant time. Again, for simplicity, we only record the result for regular low threshold rank models. Definitions will be presented in Section 7.2. Following Remark 2.9, we assume .

Theorem 2.10.

There exists a universal constant such that for any , and for any regular Ising model on nodes with (here, t is the sum-of-squares rank of the model) and , there is an algorithm running in time which computes an additive approximation to .

2.3 Approximating the expected total magnetization

Given an Ising model , one of the most fundamental questions one can ask about it is how many spins are and in expectation. In the case of the ferromagnetic Ising model ( for all ), this corresponds to how strongly the system is magnetized. Accordingly, we define the (expected total) magnetization of an Ising model by , where the expectation is with respect to the Ising distribution on . Perhaps surprisingly, our results easily show that for dense Ising models, one can obtain, in some sense, an approximation to the expected magnetization in constant time. More precisely, we have:

Theorem 2.11.

Consider a -dense Ising Model

Let

and let denote the expected total magnetization for . Then, for any , we can find in constant time (depending only on and ), an additive approximation to , for some with .

Proof.

It is well known that one can express the moments of spin systems in terms of derivatives of the log partition function. In particular, for the Ising model

, consider the family of perturbed Ising models defined by . Then, for any , we have

where denotes the expectation with respect to the Ising distribution perturbed by . In particular, equals the expected total magnetization of the Ising model we started out with. Moreover, since by Jensen’s inequality,

we see that is convex in ; in particular, for any and any , we have

Finally,

  • By the mean value theorem, the LHS /RHS of the equation above are given by and , where .

  • By setting in Theorem 2.5 and Remark 2.8, we can evaluate the LHS and RHS up to the desired error in constant time

We remark that:

  • Unfortunately, it is impossible to approximate in constant time the magnetization at the specified value of the external fields. For example, consider an Ising model on vertices, where for some large if and otherwise. Let if and if . We set all the other to except that we set , where is uniformly chosen in and is uniformly chosen in . Note that this is a dense Ising model as per our definition. Note also that on the nodes we have the Ising model on the complete graph with one (random) node having external field.

    It is easy to see that if , the magnetization is . The fact that is a large constant implies that conditioning on one vertex taking the value results in a dramatic change in magnetization on the vertices . In particular, the magnetization is of order if and is of order if . It thus follows that we need

    queries in order to determine the magnetization in this case. We note that this example corresponds to a phase transition – in particular, for every

    , if then for all values of and . See ([6]) for general references for the Ising model on the complete graph.

  • The results for computing the magnetization readily extend to other models. For example, for Potts models, we can compute for each color the expected number of nodes of that color (up to error and for an close external field). Similarly, it is easy to check we can compute other statistics at this accuracy. For instance, for the Ising model, we can approximate if for some .

2.4 Tightness of our results

As mentioned in the introduction, our results are qualitatively tight. We now make this precise by proving the following information-theoretic lower bound on the accuracy of constant-time algorithms for additively approximating :

Theorem 2.12.

Fix . For any (possibly randomized) algorithm which probes at most entries of before returning an estimate to , there exists a -dense input instance such that makes error at least with probability at least .

Proof.

We prove the claim by reduction to a hypothesis testing problem. Specifically, we show that there exist two different dense Ising models and with log-partition functions that are at least -far apart (where ) such that no algorithm which makes only probes can distinguish between the two with probability greater than . This immediately implies that for any algorithm to estimate and for at least one of the two inputs, must make error at least with probability at least when given this input — otherwise we could use the output of to distinguish the two models with better than probability by checking which the output is closer to.

Let be an instance size to be taken sufficiently large, and consider two -dense ferromagnetic Ising models defined as follows:

  • , for which the underlying graph is the complete graph on vertices, many of the edges are randomly selected to have weight , and the remaining many edges are assigned weight . Note that since and , this model is indeed -dense for sufficiently large.

  • , for which the underlying graph is the complete graph on vertices and all edges have weight .

We denote the partition functions of these models by and respectively. It is easily seen that , and that . Therefore, for sufficiently large, and , so .

Now, we show that no algorithm can distinguish between and with probability greater than with only probes. We fix a 50/50 split between and on our input to algorithm . Since the randomized algorithm can be viewed as a mixture over deterministic algorithms, there must exist a deterministic algorithm with success probability in distinguishing from at least as large as . Let be the first edge queried by , let be the next edge queried assuming , and define similarly (without loss of generality, the algorithm uses all of its available queries). Let be the event that are all equal to . Event always happens under , and when the input is we see for . Thus the total variation distance between the observed distribution under and is at most , so by the Neyman-Pearson Lemma we know fails with probability at least . Therefore for we see that fails with probability at least which proves the result. ∎

Note that on the input instances given in this theorem, our algorithm will use at most queries to output an approximation for both models with probability . Finding the optimal dependence of the running time on the parameters remains an interesting open problem.

2.5 Outline of the techniques

To illustrate the main reason for the intractability of the (log) partition function of an Ising model, we consider the ferromagnetic case where , . In this case, it is clear that a given magnetized state , where almost all of the spins are either or , is more likely than a given unmagnetized state , where the spins are almost evenly split between and . However, since the number of states with exactly spins equal to is simply , we see that the total number of strongly magnetized states is exponentially smaller than the total number of unmagnetized states. Therefore, while any given unmagnetized state is less likely than any given magnetized state, it may very well be the case that the total probability of the system being in an unmagnetized state is greater than that of the system being in a magnetized state.

In this paper, we present an approach to dealing with this tradeoff in dense Ising models (Theorem 2.5) based on the algorithmic regularity lemma of Frieze and Kannan (Theorem 3.3). Roughly speaking, this lemma allows us to efficiently partition the underlying (weighted) graph into a small number of blocks in a manner such that “cut-like” quantities associated to the graph approximately depend only on the numbers of edges between various blocks. In Lemma 4.1, we show that the log partition function fits into this framework. A similar statement for MAX-CUT (which may be viewed as the limiting case of our model when only antiferromagnetic interactions are allowed and tends to infinity) was first obtained in [7].

The key point in the regularity lemma is that the number of blocks depends only on the desired quality of approximation, and not of the size of the underlying graph. Together with the previous statement, this shows (Lemma 4.2) that one can approximately rewrite the sum computing the partition function in terms of only polynomially many nonnegative summands, as opposed to the exponentially many nonnegative summands we started out with. This provides a way to resolve the entropy-energy tradeoff – the log of the largest summand of this much smaller sum approximates the log of the partition function well (Lemma 5.1).

The preceding discussion reduces the problem of estimating the log partition function to an optimization problem, although a very different one from [18]. However, as stated, it is not a problem we can solve in constant time. In Lemma 5.2, we show how to “granulate” the parameters to reduce the problem to constant size, and then our Algorithm 1 solves this problem efficiently via convex programming. The proofs of Theorem 2.6, Theorem 2.7 and Theorem 2.10 follow a similar outline, with the application of Theorem 3.3 replaced by Theorem 7.5, Theorem 7.6 and Theorem 7.9 respectively.

3 Preliminaries

In the case of the dense Ising model, our reduction to a problem that can be solved in constant time will be based on the algorithmic weak regularity lemma of Frieze and Kannan [8]. Before stating it, we introduce some terminology. Throughout this section, we will deal with matrices whose entries we will index by , where .

Definition 3.1.

Given , and , we define the Cut Matrix by

Definition 3.2.

A Cut Decomposition expresses a matrix as

where for all . We say that such a cut decomposition has width , coefficient length and error .

We are now ready to state the algorithmic weak regularity lemma of Frieze and Kannan.

Theorem 3.3.

[8] Let be an arbitrary real matrix, and let . Then, in time , we can, with probability , find a cut decomposition of width , coefficient length at most and error at most .

4 Reducing the number of summands using weak regularity

The next lemma shows that for the purpose of additively approximating the log partition function, we may as well work with the matrix . We define .

Lemma 4.1.

Let be the matrix of interaction strengths of a -dense Ising model. For , let be a cut decomposition of as in Theorem 3.3.Then, for and as above, we have .

Proof.

Note that for any , we have

Therefore, for any , we have

Taking first the sum of these inequalities over all and then the log, we get

Finally, noting that , and the proof follows. ∎

Recall that for each , is a cut matrix between vertex sets and . We pass to a common refinement of all of the ’s and ’s. This gives at most disjoint sets such that that every one of the ’s and ’s is a union of these “atoms” . In terms of these ’s, we can define another approximation to the partition function:

where ranges over all elements of , , and is the binary entropy function. Note that represents the net spin in , and similarly for . The next lemma shows that for approximating , it suffices to obtain an approximation to . Combined with Lemma 4.1, we therefore see that approximating is sufficient for approximating .

Lemma 4.2.

For and as above, we have .

Proof.

First, observe that . Thus, letting and , we see that that

Re-expressing the summation in terms of the possible values that and can take, we get that

where ranges over all elements of , , and similarly for and . Next, since

it follows that . Finally, we apply Stirling’s formula

to get the desired result. ∎

5 Approximating the reduced sum in constant time using convex programming

So far, we have seen how the problem of estimating the log partition function can be reduced to the problem of approximating . The next simple lemma reduces the estimation of to an optimization problem.

Lemma 5.1.

Let . Then,

Proof.

This follows immediately by noting that , which is a sum of at most many nonnegative summands, is at least as large as its largest summand, and no larger than times its largest summand. ∎

The following lemma shows that for estimating the contribution of the term corresponding to some vector , it suffices to know the components of up to some constant precision. This reduces the optimization problem to one of constant size.

Lemma 5.2.

Let denote the matrix of interaction strengths of a -dense Ising model, and let denote a cut decomposition as in Theorem 3.3. Then, given for and some such that , and for all , we get that .

Proof.

From Theorem 3.3, we know that for all , . Since , it follows that

which completes the proof. ∎

We now present Algorithm 1 which approximates the log of the largest summand in by iterating over all possible proportions of up-spins in each block . For each proportion, it approximately solves the following convex program (where is as before, is a parameter to be specified, and ):

The infeasibility of this program means that our “guess” for the proportion of up-spins in the various is not combinatorially possible (recall that there may be significant overlap between the various . On the other hand, if this program is feasible, then we actually obtain an approximate maximum entropy configuration with roughly the prescribed number of up-spins in , . Calculating the contribution to of such a configuration, and maximizing this contribution over all iterations gives a good approximation111A slightly better approximation to can be found by estimating the discrete sum over each region by an integral (using Lemma 5.4 to bound the error), and approximating this integral via the hit-and-run random walk of [16]. However, this algorithm is slower than Algorithm 1, and it turns out the gain in accuracy is negligible for our application. to the maximum summand of , and hence to . This is the content of the next lemma.

Let .
.
for  do
     Either find a -approximate solution to, or verify infeasibility of, . (e.g. by ellipsoid method [10])
     if  is feasible then
         Let denote the -approximate maximum value of .
         Let and .
         Let .
         .
     end if
end for
return
Algorithm 1 Convex programming method to estimate log partition unction
Lemma 5.3.

The output of Algorithm 1 is an additive approximation to when .

Proof.

By Lemma