# Learning Loosely Connected Markov Random Fields

We consider the structure learning problem for graphical models that we call loosely connected Markov random fields, in which the number of short paths between any pair of nodes is small, and present a new conditional independence test based algorithm for learning the underlying graph structure. The novel maximization step in our algorithm ensures that the true edges are detected correctly even when there are short cycles in the graph. The number of samples required by our algorithm is C*log p, where p is the size of the graph and the constant C depends on the parameters of the model. We show that several previously studied models are examples of loosely connected Markov random fields, and our algorithm achieves the same or lower computational complexity than the previously designed algorithms for individual cases. We also get new results for more general graphical models, in particular, our algorithm learns general Ising models on the Erdos-Renyi random graph G(p, c/p) correctly with running time O(np^5).

There are no comments yet.

## Authors

• 14 publications
• 37 publications
• 9 publications
01/15/2021

### Learning to Sample from Censored Markov Random Fields

We study learning Censor Markov Random Fields (abbreviated CMRFs). These...
02/08/2012

### Greedy Learning of Markov Network Structure

We propose a new yet natural algorithm for learning the graph structure ...
10/24/2019

### Structure Learning of Gaussian Markov Random Fields with False Discovery Rate Control

In this paper, we propose a new estimation procedure for discovering the...
07/11/2012

### PAC-learning bounded tree-width Graphical Models

We show that the class of strongly connected graphical models with treew...
02/16/2018

### The Vertex Sample Complexity of Free Energy is Polynomial

We study the following question: given a massive Markov random field on ...
08/31/2020

### Uncertainty quantification for Markov Random Fields

We present an information-based uncertainty quantification method for ge...
10/30/2009

### Which graphical models are difficult to learn?

We consider the problem of learning the structure of Ising models (pairw...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

In many models of networks, such as social networks and gene regulatory networks, each node in the network represents a random variable and the graph encodes the conditional independence relations among the random variables. A Markov random field is a particular such representation which has applications in a variety of areas (see

[3] and the references therein). In a Markov random field, the lack of an edge between two nodes implies that the two random variables are independent, conditioned on all the other random variables in the network.

Structure learning, i.e, learning the underlying graph structure of a Markov random field, refers to the problem of determining if there is an edge between each pair of nodes, given i.i.d. samples from the joint distribution of the random vector. As a concrete example of structure learning, consider a social network in which only the participants’ actions are observed. In particular, we do not observe or are unable to observe, interactions between the participants. Our goal is to infer relationships among the nodes (participants) in such a network by understanding the correlations among the nodes. The canonical example used to illustrate such inference problems is the US Senate

[4]. Suppose one has access to the voting patterns of the senators over a number of bills (and not their party affiliations or any other information), the question we would like to answer is the following: can we say that a particular senator’s vote is independent of everyone else’s when conditioned on a few other senators’ votes? In other words, if we view the senators’ actions as forming a Markov Random Field (MRF), we want to infer the topology of the underlying graph.

In general, learning high dimensional densely connected graphical models requires large number of samples, and is usually computationally intractable. In this paper, we focus on a more tractable family which we call loosely connected MRFs. Roughly speaking, a Markov random field is loosely connected if the number of short paths between any pair of nodes is small. We show that many previously studied models are examples of this family. In fact, as densely connected graphical models are difficult to learn, some sparse assumptions are necessary to make the learning problem tractable. Common assumptions include an upper bound on the node degree of the underlying graph [7, 15]

, restrictions on the class of parameters of the joint probability distribution of the random variables to ensure correlation decay

[7, 15, 2], lower bounds on the girth of the underlying graph [15], and a sparse, probabilistic structure on the underlying random graph [2]. In all these cases, the resulted MRFs turn out to be loosely connected. In this sense, our definition here provides a unified view of the assumptions in previous works.

However, loosely connected MRFs are not always easy to learn. Due to the existence of short cycles, the dependence over an edge connecting a pair of neighboring nodes can be approximately cancelled by some short non-direct paths between them, in which case correctly detecting this edge is difficult, as shown in the following example. This example is perhaps well-known, but we present it here to motivate our algorithm presented later.

###### Example 1.1.

Consider three binary random variables . Assume are independent random variables and with probability , where means exclusive or. We note that this joint distribution is symmetric, i.e., we get the same distribution if we assume that are independent and with probability . Therefore, the underlying graph is a triangle. However, it is not hard to see that the three random variables are marginally independent. For this simple example, previous methods in [15, 3] fail to learn the true graph.∎

We propose a new algorithm that correctly learns the graphs for loosely connected MRFs. For each node, the algorithm loops over all the other nodes to determine if they are neighbors of this node. The key step in the algorithm is a max-min conditional independence test, in which the maximization step is designed to detect the edges while the minimization step is designed to detect non-edges. The minimization step is used in several previous works such as [2, 3]. The maximization step has been added to explicitly break the short cycles that can cause problems in edge detection. If the direct edge is the only edge between a pair of neighboring nodes, the dependence over the edge can be detected by a simple independence test. When there are other short paths between a pair of neighboring nodes, we first find a set of nodes that separates all the non-direct paths between them, i.e., after removing this set of nodes from the graph, the direct edge is the only short path connecting to two nodes. Then the dependence over the edge can again be detected by a conditional independence test where the conditioned set is the set above. In Example 1.1, and are unconditionally independent as the dependence over edge is canceled by the other path . If we break the cycle by conditioning on , and become dependent, so our algorithm is able to detect the edges correctly. As the size of the conditioned sets is small for loosely connected MRFs, our algorithm has low complexity. In particular, for models with at most short paths between non-neighbor nodes and non-direct paths between neighboring nodes, the running time for our algorithm is .

If the MRF satisfies a pairwise non-degeneracy condition, i.e., the correlation between any pair of neighboring nodes is lower bounded by some constant, then we can extend the basic algorithm to incorporate a correlation test as a preprocessing step. For each node, the correlation test adds those nodes whose correlation with the current node is above a threshold to a candidate neighbor set, which is then used as the search space for the more computationally expensive max-min conditional independence test. If the MRF has fast correlation decay, the size of the candidate neighbor set can be greatly reduced, so we can achieve much lower computational complexity with this extended algorithm.

When applying our algorithm to Ising models, we get lower computational complexity for a ferromagnetic Ising model than a general one on the same graph. Intuitively, the edge coefficient means that and are positively dependent. For any path between , as all the edge coefficients are positive, the dependence over the path is also positive. Therefore, the non-direct paths between a pair of neighboring nodes make and , which are positively dependent over the edge , even more positively dependent. Therefore, we do not need the maximization step which breaks the short cycles and the resulting algorithm has running time . In addition, the pairwise non-degeneracy condition is automatically satisfied and the extended algorithm can be applied.

### 1.1 Relation to Prior Work

We focus on computational complexity rather than sample complexity in comparing our algorithm with previous algorithms. In fact, it has been shown that samples are required to learn the graph correctly with high probability, where is the size of the graph [19]. For all the previously known algorithms for which analytical complexity bounds are available, the number of samples required to recover the graph correctly with high probability, i.e, the sample complexity, is . Not surprisingly, the sample complexity for our algorithm is also under reasonable assumptions.

Our algorithm with the probability test reproduces the algorithm in [7, Theorem 3] for MRFs on bounded degree graphs. Our algorithm is more flexible and achieves lower computational complexity for MRFs that are loosely connected but have a large maximum degree. In particular, reference [15] proposed a low complexity greedy algorithm that is correct when the MRF has correlation decay and the graph has large girth. We show that under the same assumptions, we can first perform a simple correlation test and reduce the search space for neighbors from all the nodes to a constant size candidate neighbor set. With this preprocessing step, our algorithm and the algorithms in [7, 15, 18] have computational complexity , which is lower than what we would get by only applying the greedy algorithm [15]. The results in [18] improve over [15] by proposing two new greedy algorithms that are correct for learning small girth graphs. However, the algorithm in [18] requires a constant size candidate neighbor set as input, which might not be easy to obtain in general. In fact, for MRFs with bad short cycles as in Example 1.1, learning a candidate neighbor set can be as difficult as directly learning the neighbor set.

Our analysis of the class of Ising models on sparse Erdős-Rényi random graphs was motivated by the results in [2] which studies the special case of the so-called ferromagnetic Ising models defined over an Erdős-Rényi random graph. The computational complexity of the algorithm in [2] is . In this case, the key step of our algorithm reduces to the algorithm in [2]. But we show that, under the ferromagnetic assumption, we can again perform a correlation test to reduce the search space for neighbors, and the total computational complexity for our algorithm is .

The results in [3] extend the results in [2] to general Ising models and more general sparse graphs (beyond the Erdős-Rényi model). We note that the tractable graph families in [3] is similar to our notion of loosely-connected MRFs. For general Ising models over sparse Erdős-Rényi random graphs, our algorithm has computational complexity while the algorithm in [3] has computational complexity . The difference comes from the fact that our algorithm has an additional maximization step to break bad short cycles as in Example 1.1. Without this maximization step, the algorithm in [3] fails for this example. The performance analysis in [3] explicitly excludes such difficult cases by noting that these “unfaithful” parameter values have Lebesgue measure zero [3, Section B.3.2]. However, when the Ising model parameters lie close to this Lebesgue measure zero set, the learning problem is still ill posed for the algorithm in [3], i.e., the sample complexity required to recover the graph correctly with high probability depends on how close the parameters are to this set, which is not the case for our algorithm. In fact, the same problem with the argument that the unfaithful set is of Lebesgue measure zero has been observed for causal inference in the Gaussian case [20]. It has been shown in [20] that a stronger notion of faithfulness is required to get uniform sample complexity results, and the set that is not strongly faithful has non-zero Lebesgue measure and can be be surprisingly large.

Another way to learn the structures of MRFs is by solving -regularized convex optimizations under a set of incoherence conditions [17]. It is shown in [13] that, for some Ising models on a bounded degree graph, the incoherence conditions hold when the Ising model is in the correlation decay regime. But the incoherent conditions do not have a clear interpretation as conditions for the graph parameters in general and are NP-hard to verify for a given Ising model [13]. Using results from standard convex optimization theory [6], it is possible to design a polynomial complexity algorithm to approximately solve the -regularized optimization problem. However, the actual complexity will depend on the details of the particular algorithm used, therefore, it is not clear how to compare the computational complexity of our algorithm with the one in [17].

We note that the recent development of directed information graphs [16] is closely related to the theory of MRFs. Learning a directed information graph, i.e., finding the causal parents of each random process, is essentially the same as finding the neighbors of each random variable in learning a MRF. Therefore, our algorithm for learning the MRFs can potentially be used to learn the directed information graphs as well.

The paper is organized as follows. We present some preliminaries in the next section. In Section 3, we define loosely-connected MRFs and show that several previously studied models are examples of this family. In Section 4, we present our algorithm and show the conditions required to correctly recover the graph. We also provide the concentration results in this section. In Section 5, we apply our algorithm to the general Ising models studied in Section 3 and evaluate its sample complexity and computational complexity in each case. In Section 6, we show that our algorithm achieves even lower computational complexity when the Ising model is ferromagnetic. Experimental results are presented in Section 7.

## 2 Preliminaries

### 2.1 Markov Random Fields (MRFs)

Let be a random vector with distribution and be an undirected graph consisting of nodes with each node associated with the element of Before we define an MRF, we introduce the notation to denote any subset of the random variables in A random vector and graph pair is called an MRF if it satisfies one of the following three Markov properties:

1. Pairwise Markov: where denotes independence.

2. Local Markov: where is the set of neighbors of node

3. Global Markov: , if separates on . In this case, we say is an I-map of . Further if is an I-map of and the global Markov property does not hold if any edge of is removed, then is called a minimal I-map of X.

In all three cases, encodes a subset of the conditional independence relations of and we say that is Markov with respect to . We note that the global Markov property implies the local Markov property, which in turn implies the pairwise Markov property.

When , the three Markov properties are equivalent, i.e., if there exists a under which one of the Markov properties is satisfied, then the other two are also satisfied. Further, in the case when there exists a unique minimal I-map of . The unique minimal I-map is constructed as follows:

1. Each random variable is associated with a node

2. if and only if .

In this case, we consider the case and are interested in learning the structure of the associated unique minimal I-map. We will also assume that, for each takes on values in a discrete, finite set . We will also be interested in the special case where the MRF is an Ising model, which we describe next.

### 2.2 Ising Model

Ising models are a type of well-studied pairwise Markov random fields. In an Ising model, each random variable takes values in the set and the joint distribution is parameterized by constants called edge coefficients and external fields

 P(x)=1Zexp(∑(i,j)∈EJijxixj+∑i∈Vhixi).

where is a normalization constant to make a probability distribution. If , we say the Ising model is zero-field. If , we say the Ising model is ferromagnetic.

Ising models have the following useful property. Given an Ising model, the conditional probability corresponds to an Ising model on with edge coefficients unchanged and modified external fields , where is the additional external field on node induced by fixing .

### 2.3 Random Graphs

A random graph is a graph generated from a prior distribution over the set of all possible graphs with a given number of nodes. Let be a function on graphs with nodes and let be a constant. We say almost always for a family of random graphs indexed by if as . Similarly, we say almost always for a family of random graphs if as This is a slight variation of the definition of almost always in [1].

The Erdős-Rényi random graph is a graph on nodes in which the probability of an edge being in the graph is and the edges are generated independently. We note that, in this random graph, the average degree of a node is . In this paper, when we consider random graphs, we only consider the Erdős-Rényi random graph

### 2.4 High-Dimensional Structure Learning

In this paper, we are interested in inferring the structure of the graph associated with an MRF We will assume that and will refer to the corresponding unique minimal I-map. The goal of structure learning is to design an algorithm that, given i.i.d. samples from the distribution

outputs an estimate

which equals with high probability when is large. We say that two graphs are equal when their node and edge sets are identical.

In the classical setting, the accuracy of estimating is considered only when the sample size goes to infinity while the random vector dimension is held fixed. This setting is restrictive for many contemporary applications, where the problem size is much larger than the number of samples. A more suitable assumption allows both and to become large, with growing at a slower rate than In such a case, the structure learning problem is said to be high-dimensional.

An algorithm for structure learning is evaluated both by its computational complexity and sample complexity. The computational complexity refers to the number of computations required to execute the algorithm, as a function of and When is a deterministic graph, we say the algorithm has sample complexity if, for there exist constants and independent of such that for all which are Markov with respect to When is a random graph drawn from some prior distribution, we say the algorithm has sample complexity if the above is true almost always. In the high-dimensional setting is much smaller than In fact, we will show that, for the algorithms described in this paper,

## 3 Loosely Connected MRFs

Loosely connected Markov random fields are undirected graphical models in which the number of short paths between any pair of nodes is small. Roughly speaking, a path between two nodes is short if the dependence between two node is non-negligible even if all other paths between the nodes are removed. Later, we will more precisely quantify the term ”short” in terms of the correlation decay property of the MRF. For simplicity, we say that a set separates some paths between nodes and if removing disconnects these paths. In such a graphical model, if are not neighbors, there is a small set of nodes separating all the short paths between them, and conditioned on this set of variables the two variables and are approximately independent. On the other hand, if are neighbors, there is a small set of nodes separating all the short non-direct paths between them, i.e, the direct edge is the only short path connecting the two nodes after removing from the graph. Conditioned on this set of variables , the dependence of and is dominated by the dependence over the direct edge hence is bounded away from zero. The following necessary and sufficient condition for the non-existence of an edge in a graphical model shows that both the sets and above are essential for learning the graph, which we have not seen in prior work.

###### Lemma 3.1.

Consider two nodes and in Then, if and only if .

###### Proof.

Recall from the definition of the minimal I-map that if and only if . Therefore, the statement of the lemma is equivalent to

 I(Xi;Xj|XV∖{i,j})=0⇔minSmaxTI(Xi;Xj|XS,XT)=0,

where denotes the mutual information between and conditioned on and we have used the fact that is equivalent to Notice that

 minSmaxTI(Xi;Xj|XS,XT)=minSmaxT′⊃SI(Xi;Xj|XT′)

and is an increasing function in . The minimization over is achieved at i.e.,

 I(Xi;Xj|XV∖{i,j})=minSmaxTI(Xi;Xj|XS,XT).

This lemma tells that, if there is not an edge between node and , we can find a set of nodes such that the removal of S from the graph separates and . From the global Markov property, this implies that . However, as Example 1.1 shows, the converse is not true. In fact, for being the empty set or , we have , but is indeed an edge in the graph. The above lemma completes the statement in the converse direction, showing that we should also introduce a set in addition to the set to correctly identify the edge.

Motivated by this lemma, we define loosely connected MRFs as follows.

###### Definition 3.2.

We say a MRF is -loosely connected if

1. for any , with , with ,

 Δ(Xi;Xj|XS,XT)≤ϵ4,
2. for any , with , with ,

 Δ(Xi;Xj|XS,XT)≥ϵ,

for some conditional independence test .

The conditional independence test should satisfy if and only if . In this paper, we use two types of conditional independence tests:

• Mutual Information Test:

 Δ(Xi;Xj|XS,XT)=I(Xi;Xj|XS,XT).
• Probability Test:

 Δ(Xi;Xj|XS,XT)=maxxi,xj,x′j,xS,xT|P(xi|xj,xS,xT)−P(xi|x′j,xS,xT)|.

Later on, we will see that the probability test gives lower sample complexity for learning Ising models on bounded degree graphs, while the mutual information test gives lower sample complexity for learning Ising models on graphs with unbounded degree.

Note that the above definition restricts the size of the sets and to make the learning problem tractable. We show in the rest of the section that several important Ising models are examples of loosely connected MRFs. Unless otherwise stated, we assume that the edge coefficients are bounded, i.e., .

### 3.1 Bounded Degree Graph

We assume the graph has maximum degree . For any , the set of size at most separates and , and for any set we have . For any , the set of size at most separates all the non-direct paths between and . Moreover, we have the following lower bound for neighbors from [7, Proposition 2].

###### Proposition 3.3.

When are neighbors and , there is a choice of such that

 |P(xi|xj,xS,xT)−P(xi|x′j,xS,xT)|≥tanh(2Jmin)2e2Jmax+2e−2Jmax≜ϵ.

Therefore, the Ising model on a bounded degree graph with maximum degree is a -loosely connected MRF. We note that here we do not use any correlation decay property, and we view all the paths as short.

### 3.2 Bounded Degree Graph, Correlation Decay and Large Girth

In this subsection, we still assume the graph has maximum degree . From the previous subsection, we already know that the Ising model is loosely connected. But we show that when the Ising model is in the correlation decay regime and further has large girth, it is a much sparser model than the general bounded degree case.

Correlation decay is a property of MRFs which says that, for any pair of nodes , the correlation of and decays with the distance between . When a MRF has correlation decay, the correlation of and is mainly determined by the short paths between nodes , and the contribution from the long paths is negligible. It is known that when is small compared with the Ising model has correlation decay. More specifically, we have the following lemma, which is a consequence of the strong correlation decay property [22, Theorem 1].

###### Lemma 3.4.

Assume . , then for any set and ,

 |P(xi|xj,xS)−P(xi|x′j,xS)|≤4Jmaxd[(d−1)tanhJmax]l−1≜βαl,

where and .

###### Proof.

For some given , w.l.o.g. assume . Applying the [22, Theorem 1] with , we get

 |P(xi|xj,xS)−P(xi|x′j,xS)|≤ 1−P(xi|x′j,xS)P(xi|xj,xS) ≤ 1−e−4Jmaxd[(d−1)tanhJmax]d(i,j)−1 ≤ 4Jmaxd[(d−1)tanhJmax]d(i,j)−1.

This lemma implies that, in the correlation decay regime , the Ising model has exponential correlation decay, i.e., the correlation between a pair of nodes decays exponentially with their distance. We say that a path of length is short if is above some desired threshold.

The girth of a graph is defined as the length of the shortest cycle in the graph, and large girth implies that there is no short cycle in the graph. When the Ising model is in the correlation decay regime and the girth of the graph is large in terms of the correlation decay parameters, there is at most one short path between any pair of non-neighbor nodes, and no short paths other than the direct edge between any pair of neighboring nodes. Naturally, we can use of size 1 to approximately separate any pair of non-neighbor nodes and do not need to block the other paths for neighbor nodes as the correlations are mostly due to the direct edges. Therefore, we would expect this Ising model to be -loosely connected for some constant . In fact, the following theorem gives an explicit characterization of . The condition on the girth below is chosen such that there is at most one short path between any pair of nodes, so a path is called short if it is shorter than half of the girth.

###### Theorem 3.5.

Assume and the girth satisfies

 βαg2≤A∧ln2,

where . Let . Then ,

 minS⊂V∖{i∪j}|S|≤D1 maxxi,xj,x′j,xS|P(xi|xj,xS)−P(xi|x′j,xS)|>ϵ,

and ,

 minS⊂V∖{i∪j}|S|≤D1 maxxi,xj,x′j,xS|P(xi|xj,xS)−P(xi|x′j,xS)|≤ϵ4.
###### Proof.

See Appendix A. ∎

### 3.3 Erdős-Rényi Random Graph G(p,cp) and Correlation Decay

We assume the graph is generated from the prior in which each edge is in with probability and the average degree for each node is . For this random graph, the maximum degree scales as with high probability [1]. Thus, we cannot use the results for bounded degree graphs even though the average degree remains bounded as

It is known from prior work [2] that, for ferromagnetic Ising models, i.e, for any and , when is small compared with the average degree , the random graph is in the correlation decay regime and the number of short paths between any pair of nodes is at most 2 asymptotically. We show that the same result holds for general Ising models. Our proof is related to the techniques developed in [2], but certain steps in the proof of [2] do rely on the fact that the Ising model is ferromagnetic, so the proof does not directly carry over. We point out similarities and differences as we proceed in Appendix C.

More specifically, letting for some , the following theorem shows that nodes that are at least hops from each other have negligible impact on each other. As a consequence of the following theorem, we can say that a path is short if it is at most hops.

###### Theorem 3.6.

Assume . Then, the following properties are true almost always.
(1) Let be a graph generated from the prior If are not neighbors in and separates all the paths shorter than hops between , then ,

 |P(xi|xj,xS)−P(xi|x′j,xS)|≤|B(i,γp)|(tanhJmax)γp=o(p−κ),

for all Ising models on where and is the set of all nodes which are at most hops away from .
(2) There are at most two paths shorter than between any pair of nodes.

###### Proof.

See Appendix C. ∎

The above result suggests that for Ising models on the random graph there are at most two short paths between non-neighbor nodes and one short non-direct path between neighboring nodes, i.e., it is a -loosely connected MRF. Further the next two theorems prove that such a constant exists. The proofs are in Appendix C.

###### Theorem 3.7.

For any , let be a set separating the paths shorter than between and assume , then almost always

 I(Xi;Xj|XS)=o(p−2κ).

###### Theorem 3.8.

For any , let be a set separating the non-direct paths shorter than between and assume , then almost always

 I(Xi;Xj|XT)=Ω(1).

## 4 Our Algorithm and Concentration results

Learning the structure of a graph is equivalent to learning if there exists an edge between every pair of nodes in the graph. Therefore, we would like to develop a test to determine if there exists an edge between two nodes or not. From Definition 3.2, it should be clear that learning a loosely connected MRF is straightforward. For non-neighbor nodes, we search for the set that separates all the short paths between them, while for neighboring nodes, we search for the set that separates all the non-direct short paths between them. As the MRF is loosely connected, the size of the above sets are small, therefore the complexity of the algorithm is low.

Given i.i.d. samples from the distribution the empirical distribution is defined as follows. For any set ,

 ^P(xA)=1nn∑i=1I{X(i)A=xA}.

Let be the empirical conditional independence test which is the same as but computed using . Our first algorithm is as follows.

For clarity, when we specifically use the mutual information test (or the probability test), we denote the corresponding algorithm by (or ). When the empirical conditional independence test is close to the exact test , we immediately get the following result.

###### Fact 4.1.

For a -loosely connected MRF, if

 |^Δ(Xi;Xj|XA)−Δ(Xi;Xj|XA)|<ϵ4

for any node and set with , then recovers the graph correctly. The running time for the algorithm is .

###### Proof.

The correctness is immediate. We note that, for each pair of in , we search in . So the possible combinations of is and we get the running time result. ∎

When the MRF has correlation decay, it is possible to reduce the computational complexity by restricting the search space for the set and to a smaller candidate neighbor set. In fact, for each node , the nodes which are a certain distance away from have small correlation with . As suggested in [7], we can first perform a pairwise correlation test to eliminate these nodes from the candidate neighbor set of node . To make sure the true neighbors are all included in the candidate set, the MRF needs to satisfy an additional pairwise non-degeneracy condition. Our second algorithm is as follows.

The following result provides conditions under which the second algorithm correctly learns the MRF.

###### Fact 4.2.

For a -loosely connected MRF with

 maxxi,xj,x′j|P(xi|xj)−P(xi|x′j)|>ϵ′ (1)

for any , if

 |^P(xi|xj)−P(xi|xj)|<ϵ′8

for any node and , and

 |^Δ(Xi;Xj|XA)−Δ(Xi;Xj|XA)|<ϵ4

for any node and set with , then recovers the graph correctly. Let . The running time for the algorithm is .

###### Proof.

By the pairwise non-degeneracy condition (1), the neighbors of node are all included in the candidate neighbor set . We note that this preprocessing step excludes the nodes whose correlation with node is below . Then in the inner loop, the correctness of the algorithm is immediate. The running time of the correlation test is . We note that, for each in , we loop over in and search and in . So the possible combinations of is . Combining the two steps, we get the running time of the algorithm. ∎

Note that the additional non-degeneracy condition (1) required for the second algorithm to execute correctly is not satisfied for all graphs (recall Example 1.1).

### 4.1 Concentration Results

In this subsection, we show a set of concentration results for the empirical quantities in the above algorithm for general discrete MRFs, which will be used to obtain the sample complexity results in Section 5 and Section 6.

###### Lemma 4.3.

Fix . Let . For ,

1. Assume . If

 n>2[(2+α)logp+2log|X|]γ2,

then ,

 |^P(xi|xj)−P(xi|xj)|<4γ

with probability for some constant .

2. Assume for some constant , and . If

 n>2[(1+α)logp+(D1+D2+1)logL+(D1+D2+2)log|X|]γ2,

then ,

 |^P(xi|xj,xS)−P(xi|xj,xS)|<2γδ

with probability for some constant .

3. Assume . If

 n>2[(1+α)logp+(D1+D2+1)logL+(D1+D2+2)log|X|]γ2,

then ,

 |^I(Xi;Xj|XS)−I(Xi;Xj|XS)|<8|X|D1+D2+2√γ

with probability for some constant ,

###### Proof.

See Appendix D. ∎

This lemma could be used as a guideline on how to choose between the two conditional independence tests for our algorithm to get lower sample complexity. The key difference is the dependence on the constant , which is a lower bound on the probability of any with the set size . The probability test requires a constant to achieve sample complexity , while the mutual information test does not depend on and also achieves sample complexity . We note that, while both tests have sample complexity, the constants hidden in the order notation may be different for the two tests. For Ising models on bounded degree graphs, we show in the next section that a constant exists, and the probability test gives a lower sample complexity. On the other hand, for Ising models on the Erdős-Rényi random graph , we could not get a constant as the maximum degree of the graph is unbounded, and the mutual information test gives a lower sample complexity.

## 5 Computational Complexity for General Ising Models

In this section, we apply our algorithm to the Ising models in Section 3. We evaluate both the number of samples required to recover the graph with high probability and the running time of our algorithm. The results below are simple combinations of the results in the previous two sections. Unless otherwise stated, we assume that the edge coefficients are bounded, i.e., . Throughout this section, we use the notation to denote the minimum of and .

### 5.1 Bounded Degree Graph

We assume the graph has maximum degree . First we have the following lower bound on the probability of any finite size set of variables.

, .

###### Proof.

See Appendix A. ∎

Our algorithm with the probability test for the bounded degree graph case reproduces the algorithm in [7]. For completeness, we state the following result without a proof since it is nearly identical to the result in [7], except for some constants.

###### Corollary 5.2.

Let be defined as in Proposition 3.3. Define

 δ=2−2dexp(−12d2Jmax).

Let . If , the algorithm   recovers with probability for some constant . The running time of the algorithm is . ∎

### 5.2 Bounded Degree Graph, Correlation Decay and Large Girth

We assume the graph has maximum degree . We also assume that the Ising model is in the correlation decay regime, i.e., , and the graph has large girth. Combining Theorem 3.5, Fact 4.1 and Lemma 4.3, We can show that the algorithm recovers the graph correctly with high probability for some constant , and the running time is for .

We can get even lower computational complexity using our second algorithm. The key observation is that, as there is no short path other than the direct edge between neighboring nodes, the correlation over the edge dominates the total correlation hence the pairwise non-degeneracy condition is satisfied. We note that the length of the second shortest path between neighboring nodes is no less than .

###### Lemma 5.3.

Assume that , and the girth satisfies

 βαg−1≤A∧ln2,

where . Let . , we have

 maxxi,xj,x′j|P(xi|xj)−P(xi|x′j)|>ϵ′.
###### Proof.

See Appendix A. ∎

Using this lemma, we can apply our second algorithm to learn the graph. Using Lemma 3.4, if node is of distance hops from node , we have

 maxxi,xj,x′j|P(xi|xj)−P(xi|x′j)|<βαlϵ′≤ϵ′4.

Therefore, in the correlation test, only includes nodes within distance from and the size since the maximum degree is ; i.e., , which is a constant independent of . Combining the previous lemma, Theorem 3.5, Fact 4.2 and Lemma 4.3, we get the following result.

###### Corollary 5.4.

Assume . Assume and satisfy Theorem 3.5 and Lemma 5.3. Let be defined as in Theorem 5.2. Let . If

 n>2[(2+α)logp+2lϵ′logd+3log2]γ2,

the algorithm recovers with probability for some constant . The running time of the algorithm is . ∎

### 5.3 Erdős-Rényi Random Graph G(p,cp) and Correlation Decay

We assume the graph is generated from the prior in which each edge is in with probability and the average degree for each node is . Because the random graph has unbounded maximum degree, we cannot lower bound for the probability of a finite size set of random variables by a constant, for all . To get good sample complexity, we use the mutual information test in our algorithm. Combining Theorem 3.7, Theorem 3.8, Fact 4.1 and Lemma 4.3, we get the following result.

###### Corollary 5.5.

Assume . There exists a constant such that, for , if , the algorithm recovers the graph almost always. The running time of the algorithm is .∎

### 5.4 Sample Complexity

In this subsection, we briefly summarize the number of samples required by our algorithm. According to the results in this section and the next section, samples are sufficient in general, where the constant depends on the parameters of the model. When the Ising model is on a bounded degree graph with maximum degree , the constant is of order . In particular, if the Ising model is in the correlation decay regime, then and the constant is of order . When the Ising model is on a Erdős-Rényi random graph and is in the correlation decay regime, then the constant is lower bounded by some absolute constant independent of the model parameters.

## 6 Computational Complexity for Ferromagnetic Ising Models

Ferromagnetic Ising models are Ising models in which all the edge coefficients are nonnegative. We say is an edge if . One important property of ferromagnetic Ising models is association, which characterizes the positive dependence among the nodes.

###### Definition 6.1.

[9] We say a collection of random variables is associated, or the random vector is associated, if

 Cov(f(X),g(X))≥0

for all nondecreasing functions and for which exist. ∎

###### Proposition 6.2.

[12] The random vector of a ferromagnetic Ising model (possibly with external fields) is associated. ∎

A useful consequence of the Ising model being associated is as follows.

###### Corollary 6.3.

Assume is a zero field ferromagnetic Ising model. For any , .

###### Proof.

See Appendix B. ∎

Informally speaking, the edge coefficient means that and are positively dependent over the edge. For any path between , as all the edge coefficients are positive, the dependence over the path is also positive. Therefore, the non-direct paths between a pair of neighboring nodes make and , which are positively dependent over the edge , even more positively dependent. This observation has two important implications for our algorithm.

1. We do not need to break the short cycles with a set in order to detect the edges, so the maximization in the algorithm can be removed.

2. The pairwise non-degeneracy is always satisfied for some constant , so we can apply the correlation test to reduce the computational complexity.

### 6.1 Bounded Degree Graph

We assume the graph has maximum degree . We have the following non-degeneracy result for ferromagnetic Ising models.

###### Lemma 6.4.

and ,

 maxxi,xj,x′j|P(xi|xj,xS)−P(xi|x′j,xS)|≥ 116(1−e−