1 Introduction
Graphical models are widely used to represent the statistical relations among a set of random variables (Lauritzen, 1996; MacKay, 2003)
. Nodes of the graph correspond to random variables and edges of the graph represent statistical interactions among the variables. The problems of inference and learning on graphical models arise in many practical applications. The problem of inference is to deduce certain statistical properties (such as marginal probabilities, modes etc.) of a given set of random variables whose graphical model is known. It has wide applications in areas such as error correcting codes, statistical physics and so on. The problem of learning on the other hand is to deduce the graphical model of a set of random variables given statistics (possibly from samples) of the random variables. Learning is also a widely encountered problem in areas such as biology, anthropology and so on.
The Ising model
, a class of binaryvariable graphical models with pairwise interactions, has been studied by physicists as a simple model of orderdisorder transitions in magnetic materials
(Onsager, 1944). Remarkably, it was found that in the special case of an Ising model with zeromean binary random variables and pairwise interactions defined on a planar graph, calculation of the partition function (which is closely tied to inference) is tractable, essentially reducing to calculation of a matrix determinant (Kac and Ward, 1952; Sherman, 1960; Kasteleyn, 1963; Fisher, 1966). These methods have been used in machine learning (Schraudolph and Kamenetsky, 2008; Globerson and Jaakkola, 2007).We address the problem of approximating a collection of binary random variables (given their pairwise marginal distributions) by a zeromean planar Ising model. We also consider the related problem of selecting a nonzero mean Ising model defined on an outerplanar graph (these models are also tractable, being essentially equivalent to a zerofield model on a related planar graph).
There has been a great deal of work on learning graphical models. Much of these have focused on learning over the class of thin graphical models (Deshpande et al., 2001; Bach and Jordan, 2001; Karger and Srebro, 2001; Shahaf et al., 2009) for which inference is tractable by converting the model to a junction tree. The simplest case of this is learning tree models (treewidth one graphs) for which it is tractable to find the best tree model by reduction to a maxweight spanning tree problem (Chow and Liu, 1968). However, the problem of finding the best boundedtreewidth model is NPhard for treewidths greater than two (Karger and Srebro, 2001)
, and so heuristic methods are used to select the graph structure
(Deshpande et al., 2001; Karger and Srebro, 2001). Another popular method is to use convex optimization of the loglikelihood penalized by norm of parameters of the graphical model so as to promote sparsity (Banerjee et al., 2008; Lee et al., 2006). To go beyond lowtreewidth graphs, such methods either focus on Gaussian graphical models or adopt a tractable approximation of the likelihood. Other methods learn only the graph structure itself (Ravikumar et al., 2010; Abbeel et al., 2006)and are often able to demonstrate asymptotic correctness of this estimate under appropriate conditions.
In contrast to existing approaches, this paper explores planarity as an alternative restriction on the model class to both make learning tractable and to offer a qualitatively different graph topology in which the number of edges learned is linear in the number of variables.
2 Preliminaries
In this section, we develop our notation and briefly review the necessary background theory.
2.1 Divergence and Likelihood
Suppose we want to calculate how well a probability distribution
approximates another probability distribution (on the same sample space ). For any two probability distributions and on some sample space , we denote by the KullbackLeibler divergence (or relative entropy) between and as . The loglikelihood function is defined as . The probability distribution in a family that maximizes the loglikelihood of a probability distribution is called the maximumlikelihood estimate of in , and this is equivalent to the minimumdivergence projection of to , so that .2.2 Graphical Models and The Ising Model
We will be dealing with binary random variables throughout the paper. We write to denote the probability distribution of a collection of random variables . Unless otherwise stated, we work with undirected graphs with vertex (or node) set and edges . For vertices we write to denote the graph . A pairwise graphical model is a probability distribution that is defined on a graph with vertices as
(1) 
where are nonnegative node and edge compatibility functions. For positive ’s, we may also represent as a Gibbs distribution with potentials and .
Definition 1.
An Ising model on binary random variables and graph is the probability distribution defined by
where . The partition function serves to normalize the probability distribution.
Formally, this defines an exponential family (BarndorffNielsen, 1979; Wainwright and Jordan, 2008) based on sufficient statistics and , parameters and
and moment parameters
and . The function is a convex function of and has the moment generating properties: and .In fact, any pairwise graphical model among binary variables can be represented as an Ising model:
The moments can be computed as: and . Inversely, the marginals are computed by:
An Ising model is said to be zerofield if for all . It is zeromean if () for all . The Ising model is zerofield if and only if it is zeromean. Although the zerofield assumption appears very restrictive, a general Ising model can be represented as a zerofield model by adding one auxiliary variable node connected to every other node of the graph (Globerson and Jaakkola, 2007). The parameters and moments of the two models are then related as follows:
Proposition 1.
Consider the Ising model on with , parameters and , moments and and partition function . Let denote the extended graph based on nodes with edges . We define a zerofield Ising model on with parameters , moments and partition function . If we set the parameters according to
then and
Thus, inference on the corresponding zerofield Ising model on the extended graph is equivalent to inference on the (nonzerofield) Ising model defined on . Proof given in the Supplement.
2.3 Inference for Planar Ising Models
A graph is planar if it may be embedded in the plane without any edge crossings. It is known that any planar graph can be embedded such that all edges are drawn as straight lines. The motivation for our paper is the following result on tractability of inference for the planar zerofield Ising model.
Theorem 1.
(Kac and Ward, 1952; Sherman, 1960; Loebl, 2010) Let be a planar graph with specified straightline embedding in the plane and let denote the clockwise rotation between the directed edges and . We define the matrix indexed by directed edges of the graph as follows: where is the diagonal matrix with and
Then, the partition function of the zerofield planar Ising model is given by the KacWard determinant formula:
Another related method for computing the Ising model partition function is based on counting perfect matchings of planar graphs (Kasteleyn, 1963; Fisher, 1966). Thus, calculating the partition function reduces to calculating the determinant of a matrix; therefore, using the generalized nested dissection algorithm to exploit sparsity of the matrix, the complexity of these calculations is (Lipton et al., 1979; Lipton and Tarjan, 1979; Galluccio et al., 2000). Thus, inference of the zerofield planar Ising model is tractable and scales well with problem size.
The gradient and Hessian of the logpartition function can also be calculated efficiently from the KacWard determinant formula. Derivatives of recover the moment parameters of the exponential family model as (BarndorffNielsen, 1979; Wainwright and Jordan, 2008). Thus, inference of moments (and node and edge marginals) are tractable for the zerofield planar Ising model.
Proposition 2.
Let , . Let and where and are defined as in Theorem 1, denotes the elementwise product and is the permutation matrix swapping indices of directed edges and . Then,
Calculating the full matrix requires calculations. However, to compute just the moments only the diagonal elements of are needed. Then, using the generalized nested dissection method, inference of moments (edgewise marginals) of the zerofield Ising model can be achieved with complexity . Computing the full Hessian is more expensive, requiring calculations.
Inference for OuterPlanar Graphical Models
We emphasize that the above calculations require both a planar graph and a zerofield Ising model. Using the graphical transformation of Proposition 1, the latter zerofield condition may be relaxed but at the expense of adding an auxiliary node connected to all the other nodes. In general planar graphs , the new graph may not be planar and hence may not admit tractable inference calculations. However, for the subset of planar graphs where this transformation does preserve planarity inference is still tractable.
Definition 2.
A graph is said to be outerplanar if there exists an embedding of in the plane where all the nodes are on the outer face.
In other words, the graph is outerplanar if the extended graph (defined by Proposition 1) is planar. Then, from Proposition 1 and Theorem 1 it follows that:
Proposition 3.
(Globerson and Jaakkola, 2007) The partition function and moments of any outerplanar Ising graphical model (not necessarily zerofield) can be calculated efficiently. Hence, inference is tractable for any binaryvariable graphical model with pairwise interactions defined on an outerplanar graph.
This motivates the problem of learning outerplanar graphical models for a collection of (possibly nonzero mean) binary random variables.
3 Learning Planar Ising Models
This section addresses the main goals of the paper, which are twofold:

Solving for the maximumlikelihood Ising model on a given planar graph to best approximate a collection of zeromean random variables.

How to select (heuristically) the planar graph to obtain the best approximation.
We address these problems in the following two subsections. The solution of the first problem is an integral part of our approach to the second. Both solutions are easily adapted to the context of learning outerplanar graphical models of (possibly nonzero mean) binary random variables.
3.1 MaximumLikelihood Parameter Estimation
Maximumlikelihood estimation over an exponential family is a convex optimization problem based on the logpartition function . In the case of the zerofield Ising model defined on a given planar graph it is tractable to compute via a matrix determinant described in Theorem 1. Thus, we obtain an unconstrained, tractable, convex optimization problem for the maximumlikelihood zerofield Ising model on the planar graph to best approximate a probability distribution :
Here, for all edges and the matrix is as defined in Theorem 1. If represents the empirical distribution of a set of independent identicallydistributed (iid) samples then are the corresponding empirical moments .
Newton’s Method
We solve this unconstrained convex optimization problem using Newton’s method with stepsize chosen by backtracking line search (Boyd and Vandenberghe, 2004). This produces a sequence of estimates calculated as follows:
where and are calculated using Proposition 2 and is a stepsize parameter chosen by backtracking line search (see Boyd and Vandenberghe (2004): Chapter 9, Section 2 for details). The per iteration complexity of this optimization is using explicit computation of the Hessian at each iteration. This complexity can be offset somewhat by only recomputing the Hessian a few times (reusing the same Hessian for a number of iterations), to take advantage of the fact that the gradient computation only requires calculations. As Newton’s method has quadratic convergence, the number of iterations required to achieve a highaccuracy solution is typically 816 iterations (essentially independent of problem size). We estimate the computational complexity of solving this convex optimization problem as roughly .
3.2 Greedy Planar Graph Selection
We now consider the problem of selection of the planar graph to best approximate a probability distribution with pairwise moments given for all . Formally, we seek the planar graph that maximizes the loglikelihood (minimizes the divergence) relative to :
where is the set of planar graphs on the vertex set , denotes the family of zerofield Ising models defined on graph and is the maximumlikelihood (minimumdivergence) approximation to over this family.
We obtain a heuristic solution to this graph selection problem using the following greedy edgeselection procedure. The input to the algorithm is a probability distribution (which could be empirical) on binary random variables. In fact, it is sufficient to summarize by its pairwise correlations on all pairs . The output is a maximal planar graph and the maximumlikelihood approximation to in the family of zerofield Ising models defined on this graph. A maximal planar graph is a planar graph for which no new edge can be added that would maintain planarity.
The algorithm starts with an empty graph and then sequentially adds edges to the graph one at a time so as to (heuristically) increase the loglikelihood (decrease the divergence) relative to as much as possible at each step. Here is a more detailed description of the algorithm along with estimates of the computational complexity of each step:

Line 3. First, we enumerate the set of all edges one might add (individually) to the graph while preserving planarity. This is accomplished by an algorithm in which we iterate over all pairs and for each such pair we form the graph and test planarity of this graph using known algorithms (Chrobak and Payne, 1995).

Line 4. Next, we perform tractable inference calculations with respect to the Ising model on to calculate the pairwise correlations for all pairs . This is accomplished using inference calculations on augmented versions of the graph . For each inference calculation we add as many edges to from as possible (setting on these edges) while preserving planarity and then calculate all the edgewise moments of this graph using Proposition 2 (including the zeroedges). This requires at most iterations to cover all pairs of , so the worstcase complexity to compute all required pairwise moments is .

Line 5. Once we have these moments, which specify the corresponding pairwise marginals of the current Ising model, we compare these moments (pairwise marginals) to those of the input distribution by evaluating the pairwise KLdivergence between the Ising model and . As seen by the following proposition, this gives us a lowerbound on the improvement obtained by adding edge (see Supplement for proof):
Proposition 4.
Let and be projections of on and respectively. Then,
where and represent the marginal distributions on of probabilities and respectively.
Thus, we greedily select the next edge to add so as to maximize this lowerbound on the improvement measured by the increase on loglikelihood (this being equal to the decrease in KLdivergence).

Line 6. Finally, we calculate the new maximumlikelihood parameters on the new graph . This involves solving the convex optimization problem discussed in the preceding subsection, which requires complexity. This step is necessary in order to subsequently calculate the pairwise moments which guide further edgeselection steps, and also to provide the final estimate.
We continue adding one edge at a time until a maximal planar graph (with edges) is obtained. Thus, the total complexity of our greedy algorithm for planar graph selection is .
NonMaximal Planar Graphs
Since adding an edge always improves the loglikelihood, the greedy algorithm always outputs a maximal planar graph. However, this might lead to overfitting of the data especially when the input probability distribution is an empirical distribution. Note that at edges, the maximal planar graph is sparse and our empirical work indicates that overfitting is often not an issue. In the case that overfitting is a concern, we could terminate the algorithm when adding an edge to the graph would only improve the loglikelihood by less than some threshold . An experimental search can be performed for a suitable value of this threshold (e.g. so as to minimize some estimate of the generalization, such as in cross validation methods (Zhang, 1993)). Or, one could use some heuristic value for based on the number of samples such as Akaike’s information criterion (AIC) or Shwarz’s Bayesian information criterion (BIC) (Akaike, 1974; Schwarz, 1978).
OuterPlanar Graphs and NonZero Means
The greedy algorithm returns a zerofield Ising model (which has zero mean for all the random variables) defined on a planar graph. If the actual random variables are nonzero mean, this may not be desirable. For this case we may prefer to exactly model the means of each random variable but still retain tractability by restricting the greedy learning algorithm to select outerplanar graphs. This model faithfully represents the marginals of each random variable but at the cost of modeling fewer pairwise interactions among the variables.
This is equivalent to the following procedure. First, given the sample moments and we convert these to an equivalent set of zeromean moments on the extended vertex set according to Proposition 1. Then, we select a zeromean planar Ising model for these moments using our greedy algorithm. However, to fit the means of each of the original variables, we initialize this graph to include all the edges for all (requiring that these are present in our final estimate of the graph ). After this initialization step, we use the same greedy edgeselection procedure as before. This yields the graph and parameters . Lastly, we convert back to a (nonzero field) Ising model on the subgraph of defined on nodes , as prescribed by Proposition 1. The resulting graph and parameters is our heuristic solution for the maximumlikelihood outerplanar Ising model.
We remark that it is not essential to choose between the zerofield planar Ising model and the outerplanar Ising model. The greedy algorithm may instead select something in between—a partial outerplanar Ising model where only nodes of the outerface are allowed to have nonzero means. This is accomplished simply by omitting the initialization step of adding edges for all .
4 Experiments
We present the results of experiments evaluating our algorithm on known models with simulated data to evaluate the correctness of the learned models. We generate two styles of known Ising models: a grid () with zerofield; and a node outer planar model where nodes have nonzero mean; shown in Figures 1(a) and 1(d). The edge parameters are chosen uniformly randomly between and with the condition that the absolute value be greater than a threshold (chosen to be ) so as to avoid edges with negligible interactions. We use Gibbs sampling to obtain samples from this model and calculate empirical moments from these samples which are then passed as input to our algorithm. We run 10 trials of randomly generated edge parameters and data samples. Though our algorithm can run on graphs with many more nodes, we choose small examples here to illustrate the result effectively. On the outer planar model, we ensure that the first moments of all the nodes are satisfied by starting our algorithm with the auxiliary node connected to all other nodes.
As the planar learning algorithm adds edges to the model, the likelihood of the training data is guaranteed to increase. We assess how adding edges affects the likelihood of outofsample test data. Figures 1(b) and 1(e) demonstrate that likelihood on test sets generally increases as edges are added up to the maximal planar graph. The true number of edges in each synthetic graph is marked with a vertical dotted line. On the smallest datasets ( samples) the outofsample performance begins to degrade, a sign of overfitting the training data; yet the likelihood of the maximal graph is not significantly worse than the best likelihood obtained (with fewer edges).
We also compare against a Markov random field (MRF) learning algorithm for binary data (Schmidt et al., 2008), as implemented in the undirected graphical model learning Matlab package, UGMLearn^{1}^{1}1http://www.cs.ubc.ca/~murphyk/Software/L1CRF. UGM is not restricted to learning planar graphs. The objective is optimized via projected gradient descent. We try two versions of the objective function, one using pseudolikelihood and the other using loopy belief propagation for inference. UGM employs a regularization parameter which we set using two different methods. First, we used the tuning method on validation data as detailed in Schmidt et al. (2008). That is, we split the data into two parts, train on half the data using 7 different values for the parameter, measure the data likelihood of the other half of the data and viceversa, then select the parameter value that maximizes the validation data likelihood across both folds. The learned model is trained on the full training data with the tuned regularization parameter value. The second method for setting the regularization parameter we call the oracle method, where we select the learned model at the true number of edges, , in our known models. For UGM, we set the regularization parameter via linear search until edges are learned.
We compare the likelihood of test data from the various learned models in Figures 1(c) and 1(f). For comparison, we selected the maximal planar graph that our algorithm learns, Planar maximal; as well as the planar graph learned if the algorithm were stopped when the true number of edges are learned, Planar oracle. We compare against UGM pseudo tuned and UGM loopy tuned, both of which tune the regularization parameter on validation data; but the former uses pseudolikelihood in learning and the latter uses loopy belief propagation. The tuning method is the most common way of selecting the regularization parameter, but tends to produce relatively dense graphs. For fair comparison, we also show the likelihood of UGM pseudo oracle and UGM loopy oracle; that is, the model with the known true number of edges.
Figures 1(c) and 1(f) show that our greedy planar Ising model learning algorithm is at least as accurate and often better than the UGM learning algorithms on these inputs. As mentioned earlier, we see that Planar maximal and Planar oracle fit test data nearly equally well. On the outer planar model, UGM pseudo tuned performs nearly as well as our planar algorithm, yet on the larger grid model it performs quite poorly at the smaller sample sizes. UGM loopy tuned performs more consistently close to our planar algorithm, but it seems that loopy belief propagation performs worse at large sample sizes.
On the largest dataset ( samples) of the grid model, UGM was aborted after running for 40 hours without reaching convergence on a single run, and so results are not available.
5 Application: Modeling Correlations of Senator Voting
. (b) Likelihood of holdout data versus the number of edges in the learned graph. Note the break in the xaxis, due to tuned UGM learning dense graphs. On the tuned UGM models, we indicate standard error on number of edges learned.
We consider an interesting application of our algorithm to model correlations of senator voting following Banerjee et al. (2008). We use senator voting data from the years 2009 and 2010 to calculate correlations in the voting patterns among senators. A Yea vote is treated as and a Nay vote is treated as . We also treat nonvotes as , but only consider senators who voted in at least of the votes to limit bias. The data includes variables and 645 samples. To accommodate the nonzero mean data we add an auxiliary node and allow the algorithm to select the connections between it and other nodes. We run a 10fold crossvalidation, training on 90% of the data and measuring likelihood on the heldout 10% of data. Figure 2(b) shows that the likelihood of test data increases as edges are added. We also show the likelihood of crossvalidation test data for the UGM pseudo and UGM loopy algorithms for two different methods of choosing the value of the regularization parameter: (1) the value that produces the same number of edges as the maximal planar graph (at 318 edges); and (2) the value selected by tuning with validation data (at a variable number of edges, typically a dense graph). The likelihood of the sparse UGM models are significantly worse than the planar model. Only the UGM loopy algorithm at a very dense (nearly fully connected) graph has better fit to test data.
The maximal planar graph learned from the full dataset, shown in Figure 2, conveys many facts that are already known to us. For instance, the graph shows Sanders with edges only to Democrats which makes sense because he caucuses with Democrats. Same is the case with Lieberman. The graph also shows the senate minority leader McConnell well connected to other Republicans though the same is not true of the senate majority leader Reid. The learned UGM models can be seen in the Supplement, and they show that the nonplanar models are qualitatively different, learning one or two densely connected components.
6 Conclusion and Future Work
We provide a greedy heuristic to obtain the maximumlikelihood planar Ising model approximation to a collection of binary random variables with known pairwise marginals. The algorithm is simple to implement with the help of known methods for tractable exact inference in planar Ising models, efficient methods for planarity testing and embedding of planar graphs. Empirical results of our algorithm on sample data and on the senate voting record show that it is competitive with arbitrary (nonplanar) graph learning.
Many directions for further work are suggested by the methods and results of this paper. Firstly, we know that the greedy algorithm is not guaranteed to find the best planar graph. In the Supplement, we provide an enlightening counterexample in which the combination of the planarity restriction and greedy method prevent the correct model from being learned. That counterexample suggests strategies one might consider to further refine the estimate. One strategy would be to allow the greedy algorithm to prune edges which turn out to be less important once later edges are added. It would also be feasible to implement a multistep greedy lookahead search technique for selection of which edge to add (or prune) next.
Another limitation is that our current framework only allows learning planar graphical models on the set of observed random variables and requires that all variables are observed in each sample. One could imagine extensions of our approach to handle missing samples or to try to identify hidden variables that were not seen in the data. This concept offers another avenue to achieve a better fit to data that is not wellapproximated by a planar graph among just the set of observed nodes, but might be wellapproximated as the marginal distribution of a planar model with more nodes.
Supplementary Appendix
Appendix A Proofs
Proposition .
Let the probability distributions corresponding to and be and respectively and the corresponding expectations be and respectively. For the partition function, we have that
where the fourth equality follows from the symmetry between and in an Ising model.
For the second part, since is zerofield, we have that
Now consider any . If is fixed to a value of , then the model is the same as original on and we have
By symmetry (between and ) in the model, the same is true for and so we have
Fixing to a value of , we have
and by symmetry
Combining the two equations above, we have
∎
Proposition .
From Theorem , we see that the log partition function can be written as
where and are as given in Theorem . For the derivatives, we have
where is the derivative of the matrix with respect to
. The first equality follows from chain rule and the fact that
for any matrix . Please refer to Boyd and Vandenberghe (2004) for details.Proposition .
The proof follows from the following steps of inequalities.
where the first step follows from the Pythagorean law of information projection (Amari et al., 1992), the second step follows from the conditional rule of relative entropy (Cover and Thomas, 2006), the third step follows from the information inequality (Cover and Thomas, 2006) and finally the fourth step follows from the property of information projection to (Wainwright and Jordan, 2008). ∎
Appendix B Experiments: Counter Example
The result presented in Figure 3 illustrates the fact that our algorithm does not always recover the exact structure even when the underlying graph is planar and the algorithm is given exact moments as inputs. This counterexample gives insight into how the greedy algorithm works. The basic idea is that graphical models can have nodes which are not neighbors but are more correlated than some other nodes which are neighbors. If the spurious edges corresponding to these highly correlated nodes are added early on in the algorithm, then the actual edges may have to be left out because of the planarity restriction.
We define a zerofield Ising model on the graph in Figure 3(a) with the edge parameters as follows: and for all the other edges. Figure 3(a) shows the edge parameters in the graph pictorially using the intensity of the edges  higher the intensity of an edge, higher the corresponding edge parameter. With these edge parameters, the correlation between nodes and is greater than the correlation between any other pair of nodes. This leads to the edge between and to be the first edge added in the algorithm. However, since (the complete graph on nodes) is not planar, one of the actual edges is missed in the output graph as shown in Figure 3(b).
Appendix C Example Application: UGM Learned Models
For comparison to our planar learning algorithm, we provide the results of using the UGM MRF learning algorithm on the senate voting data. For all figures, we use a forcedirected graph drawing algorithm (Fruchterman and Reingold, 1991). Figure 4 presents the graph learned using pseudolikelihood, UGM pseudo, from the full dataset with the regularization parameter set to obtain the same number of edges as learned in the planar case (318 edges). Figure 5 presents the graph learned using pseudolikelihood, UGM pseudo tuned, from the full dataset after selecting the regularization parameter from crossvalidation tuning. Figure 6 presents the graph learned using loopy belief propagation, UGM loopy, from the full dataset with the regularization parameter set to obtain 318 edges. The graph learned using UGM loopy tuned is not displayed because it is a nearly fullyconnected graph providing no visual information.
References
 Abbeel et al. (2006) P. Abbeel, D. Koller, and A. Ng. Learning factor graphs in polynomial time and sample complexity. J. Mach. Learn. Res., 7, 2006.
 Akaike (1974) H. Akaike. A new look at the statistical model identification. IEEE Trans. Automatic Control, 19(6), 1974.

Amari et al. (1992)
S. Amari, K. Kurata, and H. Nagaoka.
Information geometry of Boltzmann machines.
IEEE Trans. Neural Networks
, 3(2), 1992.  Bach and Jordan (2001) F. Bach and M. Jordan. Thin junction trees. In NIPS, 2001.
 Banerjee et al. (2008) O. Banerjee, L. El Ghaoui, and A. d’Aspremont. Model selection through sparse maximum likelihood estimation for multivariate Gaussian or binary data. J. Mach. Learn. Res., 9, 2008.
 BarndorffNielsen (1979) O. BarndorffNielsen. Information and exponential families in statistical theory. Bull. Amer. Math. Soc., 1979.
 Boyd and Vandenberghe (2004) S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge U. Press, 2004.
 Chow and Liu (1968) C. Chow and C. Liu. Approximating discrete probability distributions with dependence trees. IEEE Trans. on Information Theory, 14, 1968.
 Chrobak and Payne (1995) M. Chrobak and T. Payne. A lineartime algorithm for drawing a planar graph on a grid. Infor. Processing Letters, 54(4), 1995.
 Cover and Thomas (2006) T. Cover and J. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). WileyInterscience, 2006.
 Deshpande et al. (2001) A. Deshpande, M. Garofalakis, and M. Jordan. Efficient stepwise selection in decomposable models. In UAI, 2001.
 Fisher (1966) M. Fisher. On the dimer solution of planar Ising models. J. Math. Phys., 7(10), 1966.
 Fruchterman and Reingold (1991) T. Fruchterman and E. Reingold. Graph drawing by forcedirected placement. Software: Practice and Experience, 21(11):1129–1164, 1991.
 Galluccio et al. (2000) A. Galluccio, M. Loebl, and J. Vondrak. New algorithm for the Ising problem: Partition function for finite lattice graphs. Physical Review Letters, 84(26), 2000.
 Globerson and Jaakkola (2007) A. Globerson and T. Jaakkola. Approximate inference using planar graph decomposition. In NIPS, 2007.
 Kac and Ward (1952) M. Kac and J. Ward. A combinatorial solution of the twodimensional Ising model. Phys. Rev., 88(6), 1952.
 Karger and Srebro (2001) D. Karger and N. Srebro. Learning Markov networks: Maximum bounded treewidth graphs. In SODA, 2001.

Kasteleyn (1963)
P. Kasteleyn.
Dimer statistics and phase transitions.
J. Math. Phys., 4(2), 1963.  Lauritzen (1996) S. Lauritzen. Graphical Models. Oxford U. Press, 1996.
 Lee et al. (2006) S. Lee, V. Ganapathi, and D. Koller. Efficient structure learning of Markov networks using l1regularization. In NIPS, 2006.
 Lipton and Tarjan (1979) R. Lipton and R. Tarjan. A separator theorem for planar graphs. SIAM J. Applied Math., 36(2), 1979.
 Lipton et al. (1979) R. Lipton, D. Rose, and R. Tarjan. Generalized nested dissection. SIAM J. Numer. Analysis, 16(2), 1979.
 Loebl (2010) M. Loebl. Discrete Mathematics in Statistical Physics. Vieweg + Teubner, 2010.
 MacKay (2003) D. MacKay. Information Theory, Inference and Learning Algorithms. Cambridge U. Press, 2003.
 Onsager (1944) L. Onsager. Crystal statistics. i. a twodimensional model with an orderdisorder transition. Phys. Rev., 65(34), 1944.

Ravikumar et al. (2010)
P. Ravikumar, M. Wainwright, and J. Lafferty.
Highdimensional graphical model selection using
regularized logistic regression.
Annals of Statistics, 38(3), 2010.  Schmidt et al. (2008) M. Schmidt, K. Murphy, G. Fung, and R. Rosales. Structure learning in random fields for heart motion abnormality detection. In Computer Vision and Pattern Recognition (CVPR), 2008 IEEE Conference on, 2008.
 Schraudolph and Kamenetsky (2008) N. Schraudolph and D. Kamenetsky. Efficient exact inference in planar Ising models. In NIPS, 2008.
 Schwarz (1978) G. Schwarz. Estimating the dimension of a model. Annals of Stat., 6(2), 1978.
 Shahaf et al. (2009) D. Shahaf, A. Checketka, and C. Guestrin. Learning thin junction trees via graph cuts. In AISTATS, 2009.
 Sherman (1960) S. Sherman. Combinatorial aspects of the Ising model for ferromagnetism. i. a conjecture of Feynman on paths and graphs. J. Math. Phys., 1(3), 1960.
 Wainwright and Jordan (2008) M. Wainwright and M. Jordan. Graphical Models, Exponential Families, and Variational Inference. Now Publishers Inc., 2008.
 Zhang (1993) P. Zhang. Model selection via multifold cross validation. Annals of Stat., 21(1), 1993.