# Interaction Screening: Efficient and Sample-Optimal Learning of Ising Models

We consider the problem of learning the underlying graph of an unknown Ising model on p spins from a collection of i.i.d. samples generated from the model. We suggest a new estimator that is computationally efficient and requires a number of samples that is near-optimal with respect to previously established information-theoretic lower-bound. Our statistical estimator has a physical interpretation in terms of "interaction screening". The estimator is consistent and is efficiently implemented using convex optimization. We prove that with appropriate regularization, the estimator recovers the underlying graph using a number of samples that is logarithmic in the system size p and exponential in the maximum coupling-intensity and maximum node-degree.

## Authors

• 12 publications
• 9 publications
• 13 publications
• 33 publications
03/13/2019

### The Log-Concave Maximum Likelihood Estimator is Optimal in High Dimensions

We study the problem of learning a d-dimensional log-concave distributio...
02/12/2018

### Region Detection in Markov Random Fields: Gaussian Case

In this work we consider the problem of model selection in Gaussian Mark...
03/31/2020

### Information-Theoretic Lower Bounds for Zero-Order Stochastic Gradient Estimation

In this paper we analyze the necessary number of samples to estimate the...
06/21/2020

### Learning of Discrete Graphical Models with Neural Networks

Graphical models are widely used in science to represent joint probabili...
10/09/2018

### Doubly Reparameterized Gradient Estimators for Monte Carlo Objectives

Deep latent variable models have become a popular model choice due to th...
09/20/2017

### Property Testing in High Dimensional Ising models

This paper explores the information-theoretic limitations of graph prope...
05/18/2020

### Optimal Representative Sample Weighting

We consider the problem of assigning weights to a set of samples or data...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A Graphical Model (GM) describes a probability distribution over a set of random variables which factorizes over the edges of a graph. It is of interest to recover the structure of GMs from random samples. The graphical structure contains valuable information on the dependencies between the random variables. In fact, the neighborhood of a random variable is the minimal set that provides us maximum information about this variable. Unsurprisingly, GM reconstruction plays an important role in various fields such as the study of gene expression

[1], protein interactions [2], neuroscience [3], image processing [4], sociology [5] and even grid science [6, 7].

The origin of the GM reconstruction problem is traced back to the seminal 1968 paper by Chow and Liu [8], where the problem was posed and resolved for the special case of tree-structured GMs. In this special tree case the maximum likelihood estimator is tractable and is tantamount to finding a maximum weighted spanning-tree. However, it is also known that in the case of general graphs with cycles, maximum likelihood estimators are intractable as they require computation of the partition function of the underlying GM, with notable exceptions of the Gaussian GM, see for instance [9], and some other special cases, like planar Ising models without magnetic field [10].

A lot of efforts in this field has focused on learning Ising models, which are the most general GMs over binary variables with pairwise interaction/factorization. Early attempts to learn the Ising model structure efficiently were heuristic, based on various mean-field approximations, e.g. utilizing empirical correlation matrices

[11, 12, 13, 14]. These methods were satisfactory in cases when correlations decrease with graph distance. However it was also noticed that the mean-field methods perform poorly for the Ising models with long-range correlations. This observation is not surprising in light of recent results stating that learning the structure of Ising models using only their correlation matrix is, in general, computationally intractable [15, 16].

Among methods that do not rely solely on correlation matrices but take advantage of higher-order correlations that can be estimated from samples, we mention the approach based on sparsistency of the so-called regularized pseudo-likelihood estimator [17]. This estimator, like the one we propose in this paper, is from the class of M-estimators i.e. estimators that are the minimum of a sum of functions over the sampled data [18]. The regularized pseudo-likelihood estimator is regarded as a surrogate for the intractable likelihood estimator with an additive -norm penalty to encourage sparsity of the reconstructed graph. The sparsistency-based estimator offers guarantees for the structure reconstruction, but the result only applies to GMs that satisfy a certain condition that is rather restrictive and hard to verify. It was also proven that the sparsity pattern of the regularized pseudo-likelihood estimator fails to reconstruct the structure of graphs with long-range correlations, even for simple test cases [19].

Principal tractability of structure reconstruction of an arbitrary Ising model from samples was proven only very recently. Bresler, Mossel and Sly in [20] suggested an algorithm which reconstructs the graph without errors in polynomial time. They showed that the algorithm requires number of samples that is logarithmic in the number of variables. Although this algorithm is of a polynomial complexity, it relies on an exhaustive neighborhood search, and the degree of the polynomial is equal to the maximal node degree.

Prior to the work reported in this manuscript the best known procedure for perfect reconstruction of an Ising model was through a greedy algorithm proposed by Bresler in [21]. Bresler’s algorithm is based on the observation that the mutual information between neighboring nodes in an Ising model is lower bounded. This observation allows to reconstruct the Ising graph perfectly with only a logarithmic number of samples and in time quasi-quadratic in the number of variables. On the other hand, Bresler’s algorithm suffers from two major practical limitations. First, the number of samples, hence the running time as well, scales double exponentially with respect to the largest node degree and with respect to the largest coupling intensity between pairs of variables. This scaling is rather far from the information-theoretic lower-bound reported in [22] predicting instead a single exponential dependency on the two aforementioned quantities. Second, Bresler’s algorithm requires prior information on the maximum and minimum coupling intensities as well as on the maximum node degree, guarantees which, in reality, are not necessarily available.

In this paper we propose a novel estimator for the graph structure of an arbitrary Ising model which achieves perfect reconstruction in quasi-quartic time (although we believe it can be provably reduced to quasi-quadratic time) and with a number of samples logarithmic in the system size. The algorithm is near-optimal in the sense that the number of samples required to achieve perfect reconstruction, and the run time, scale exponentially with respect to the maximum node-degree and the maximum coupling intensity, thus matching parametrically the information-theoretic lower bound of [22]. Our statistical estimator has the structure of a consistent M-estimator implemented via convex optimization with an additional thresholding procedure. Moreover it allows intuitive interpretation in terms of what we coin the “interaction screening”. We show that with a proper -regularization our estimator reconstructs couplings of an Ising model from a number of samples that is near-optimal. In addition, our estimator does not rely on prior information on the model characteristics, such as maximum coupling intensity and maximum degree.

The rest of the paper is organized as follows. In Section 2 we give a precise definition of the structure estimation problem for the Ising models and we describe in detail our method for structure reconstruction within the family of Ising models. The main results related to the reconstruction guarantees are provided by Theorem 1 and Theorem 2. In Section 3 we explain the strategy and the sequence of steps that we use to prove our main theorems. Proofs of Theorem 1 and Theorem 2 are summarized at the end of this Section. Section 4 illustrates performance of our reconstruction algorithm via simulations. Here we show on a number of test cases that the sample complexity of the suggested method scales logarithmically with the number of variables and exponentially with the maximum coupling intensity. In Section 5 we discuss possible generalizations of the algorithm and future work.

## 2 Main Results

Consider a graph with vertexes where is the vertex set and is the undirected edge set. Vertexes are associated with binary random variables that are called spins. Edges are associated with non-zero real parameters that are called couplings. An Ising model is a probability distribution over spin configurations that reads as follows:

 μ(σ––)=1Zexp⎛⎝∑(i,j)∈Eθ∗ijσiσj⎞⎠, (1)

where is a normalization factor called the partition function.

 Z=∑σ––exp⎛⎝∑(i,j)∈Eθ∗ijσiσj⎞⎠. (2)

Notice that even though the main innovation of this paper – the efficient “interaction screening” estimator – can be constructed for the most general Ising models, we restrict our attention in this paper to the special case of the Ising models with zero local magnetic-field. This simplification is not necessary and is done solely to simplify (generally rather bulky) algebra. Later in the text we will thus refer to the zero magnetic field model (2) simply as the Ising model.

### 2.1 Structure-Learning of Ising Models

Suppose that sequences/samples of spins are observed. Let us assume that each observed spin configuration is i.i.d. from (1). Based on these measurements/samples we aim to construct an estimator of the edge set that reconstructs the structure exactly with high probability, i.e.

 P[ˆE=E]=1−ϵ, (3)

where is a prescribed reconstruction error.

We are interested to learn structures of Ising models in the high-dimensional regime where the number of observations/samples is of the order . A necessary condition on the number of samples is given in [22, Thm. 1]. This condition depends explicitly on the smallest and largest coupling intensity

 α:=min(i,j)∈E|θ∗ij|,β:=max(i,j)∈E|θ∗ij|, (4)

and on the maximal node degree

 d:=maxi∈V|∂i|, (5)

where the set of neighbors of a node is denoted by .

According to [22], in order to reconstruct the structure of the Ising model with minimum coupling intensity , maximum coupling intensity , and maximum degree , the required number of samples should be at least

 n≥max⎛⎜ ⎜⎝eβdln(pd4−1)4dαeα,lnp2αtanhα⎞⎟ ⎟⎠. (6)

We see from Eq. (6) that the exponential dependence on the degree and the maximum coupling intensity are both unavoidable. Moreover, when the minimal coupling is small, the number of samples should scale at least as .

It remains unknown if the inequality (6) is achievable. It is shown in [22, Thm. 3] that there exists a reconstruction algorithm with error probability if the number of samples is greater than

 n≥(βd(3e2βd+1)sinh2(α/4))2(16logp+4ln(2/ϵ)). (7)

Unfortunately, the existence proof presented in [22] is based on an exhaustive search with the intractable maximum likelihood estimator and thus it does not guarantee actual existence of an algorithm with low computational complexity. Notice also that the number of samples in (7) scales as when and are asymptotically large and as when is asymptotically small.

### 2.2 Regularized Interaction Screening Estimator

The main contribution of this paper consists in presenting explicitly a structure-learning algorithm that is of low complexity and which is near-optimal with respect to bounds (6) and (7). Our algorithm reconstructs the structure of the Ising model exactly, as stated in Eq. (3), with an error probability , and with a number of samples which is at most proportional to and . (See Theorem 1 and Theorem 2 below for mathematically accurate statements.) Our algorithm consists of two steps. First, we estimate couplings in the vicinity of every node. Then, on the second step, we threshold the estimated couplings that are sufficiently small to zero. Resulting zero coupling means that the corresponding edge is not present.

Denote the set of couplings around node

by the vector

. In this, slightly abusive notation, we use the convention that if a coupling is equal to zero it reads as absence of the edge, i.e. if and only if . Note that if the node degree is bounded by , it implies that the vector of couplings is non-zero in at most entries.

Our estimator for couplings around node

is based on the following loss function coined the Interaction Screening Objective (ISO):

 Sn(θ–u)=1nn∑k=1exp⎛⎝−∑i∈V∖uθuiσ(k)uσ(k)i⎞⎠. (8)

The ISO is an empirical weighted-average and its gradient is the vector of weighted pair-correlations involving . At the exponential weight cancels exactly with the corresponding factor in the distribution (1). As a result, weighted pair-correlations involving vanish as if was uncorrelated with any other spins or completely “screened” from them, which explains our choice for the name of the loss function. This remarkable “screening” feature of the ISO suggests the following choice of the Regularized Interaction Screening Estimator (RISE) for the interaction vector around node :

 ˆθ––u(λ)=\operatornamewithlimitsargminθ–u∈Rp−1Sn(θ–u)+λ∥θ–u∥1, (9)

where is a tunable parameter promoting sparsity through the additive -penalty. Notice that the ISO is the empirical average of an exponential function of which implies it is convex. Moreover, addition of the -penalty preserves the convexity of the minimization objective in Eq. (9).

As expected, the performance of RISE does depend on the choice of the penalty parameter . If is too small is too sensitive to statistical fluctuations. On the other hand, if is too large has too much of a bias towards zero. In general, the optimal value of is hard to guess. Luckily, the following theorem provides strong guarantees on the square error for the case when is chosen to be sufficiently large.

###### Theorem 1 (Square Error of RISE).

Let be realizations of spins drawn i.i.d. from an Ising model with maximum degree and maximum coupling intensity . Then for any node and for any , the square error of the Regularized Interaction Screening Estimator (9) with penalty parameter is bounded with probability at least by

 ∥∥ˆθ––u(λ)−θ–∗u∥∥2≤28√d(d+1)e3βd ⎷ln3pϵ1n, (10)

whenever .

Our structure estimator (for the second step of the algorithm), Structure-RISE, takes RISE output and thresholds couplings whose absolute value is less than to zero:

 ˆE(λ,α)={(i,j)∈V×V∣ˆθij(λ)+ˆθji(λ)≥α}. (11)

Performance of the Structure-RISE is fully quantified by the following Theorem.

###### Theorem 2 (Structure Learning of Ising Models).

Let be realizations of spins drawn i.i.d. from an Ising model with maximum degree , maximum coupling intensity and minimal coupling intensity . Then for any , Structure-RISE with penalty parameter reconstructs the edge-set perfectly with probability

 P(ˆE(λ,α)=E)≥1−ϵ2, (12)

whenever .

Proofs of Theorem 1 and Theorem 2 are given in Subsection 3.3.

Theorem 1 states that RISE recovers not only the structure but also the correct value of the couplings up to an error based on the available samples. It is possible to improve the square-error bound (10) even further by first, running Structure-RISE to recover edges, and then re-running RISE with for the remaining non-zero couplings.

The computational complexity of RISE is equal to the complexity of minimizing the convex ISO and, as such, it scales at most as . Therefore, computational complexity of Structure-RISE scales at most as simply because one has to call RISE at every node. We believe that this running-time estimate can be proven to be quasi-quadratic when using first-order minimization-techniques, in the spirit of [23]. We have observed through numerical experiments that such techniques implement Structure-RISE with running-time .

Notice that in order to implement RISE there is no need for prior knowledge on the graph parameters. This is a considerable advantage in practical applications where the maximum degree or bounds on couplings are often unknown.

## 3 Analysis

The Regularized Interaction Screening Estimator (9) is from the class of the so-called regularized M-estimators. Negahban et al. proposed in [18] a framework to analyze the square error of such estimators. As per [18], enforcing only two conditions on the loss function is sufficient to get a handle on the square error of an -regularized M-estimator.

The first condition links the choice of the penalty parameter to the gradient of the objective function.

###### Condition 1.

The -penalty parameter strongly enforces regularization if it is greater than any partial derivatives of the objective function at , i.e.

 λ≥2∥∇Sn(θ∗u)∥∞. (13)

Condition 1 guarantees that if the vector of couplings has at most non-zero entries, then the estimation difference lies within the set

 K:={Δ∈Rp−1∣∥Δ∥1≤4√d∥Δ∥2}. (14)

The second condition ensure that the objective function is strongly convex in a restricted subset of . Denote the reminder of the first-order Taylor expansion of the objective function by

 δSn(Δu,θ∗u):=Sn(θ∗u+Δu)−Sn(θ∗u)−⟨∇Sn(θ∗u),Δu⟩, (15)

where is an arbitrary vector. Then the second condition reads as follows.

###### Condition 2.

The objective function is restricted strongly convex with respect to on a ball of radius centered at , if for all such that , there exists a constant such that

 δSn(Δu,θ∗u)≥κ∥Δu∥22. (16)

Strong regularization and restricted strong convexity enables us to control that the minimizer of the full objective (9) lies in the vicinity of the sparse vector of parameters . The precise formulation is given in the proposition following from [18, Thm. 1].

###### Proposition 1.

If the -regularized M-estimator of the form (9) satisfies Condition 1 and Condition 2 with then the square-error is bounded by

 ∥∥ˆθ––u−θ–∗u∥∥2≤3√dλκ. (17)

Like the ISO (8), its gradient in any component is an empirical average

 ∂∂θulSn(θ–u)=1nn∑k=1X(k)ul(θ–u), (18)

where the random variables are i.i.d and they are related to the spin configurations according to

 Xul(θ–u)=−σuσlexp⎛⎝−∑i∈V∖uθuiσuσi⎞⎠. (19)

In order to prove that the ISO gradient concentrates we have to state few properties of the support, the mean and the variance of the random variables (

19), expressed in the following three Lemmas.

The first of the Lemmas states that at , the random variable has zero mean.

###### Lemma 1.

For any Ising model with spins and for all

 E[Xul(θ–∗u)]=0. (20)
###### Proof.

By direct computation, we find that

 E[Xul(θ–∗u)] = E[−σuσlexp(−∑i∈∂uθ∗uiσuσi)] (21) = −1Z∑σ––σuσlexp⎛⎝∑(i,j)∈Eθ∗ijσiσj−∑i∈∂uθ∗uiσuσi⎞⎠=0,

where in the last line we use the fact that the exponential terms involving cancel, implying that the sum over is zero. ∎

As a direct corollary of the Lemma 1, is always a minimum of the averaged ISO (8).

The second Lemma proves that at , the random variable has a variance equal to one.

###### Lemma 2.

For any Ising model with spins and for all

 E[Xul(θ–∗u)2]=1. (22)
###### Proof.

As a result of direct evaluation one derives

 E[Xul(θ–∗u)2] = E[exp(−2∑i∈∂uθ∗uiσuσi)] (23) = 1Z∑σ––exp⎛⎝∑(i,j)∈E,i,j≠uθ∗ijσiσj−∑i∈∂uθ∗uiσuσi⎞⎠ = 1Z∑σ––exp⎛⎝∑(i,j)∈E,i,j≠uθ∗ijσiσj+∑i∈∂uθ∗uiσuσi⎞⎠ = 1.

Notice that in the second line the first sum over edges (under the exponential) does not depend on . Furthermore, the first sum is invariant under the change of variables, , while the second sum changes sign. This transformation results in appearance of the partition function in the numerator. ∎

The next lemma states that at , the random variable has a bounded support.

###### Lemma 3.

For any Ising model with spins, with maximum degree and maximum coupling intensity , it is guaranteed that for all

 |Xul(θ–∗u)|≤exp(βd). (24)
###### Proof.

Observe that components of are smaller than and at most of them are non-zero. Recall that spins are binary, , which results in the following estimate

 |Xul(θ–∗u)| = ∣∣ ∣∣−σuσiexp(−∑i∈∂uθ∗uiσuσi)∣∣ ∣∣ (25) ≤ exp(−∑i∈∂uθ∗uiσuσi) ≤ exp(βd).

With Lemma 1, 2 and 3, and using Berstein’s inequality we are now in position to prove that every partial derivative of the ISO concentrates uniformly around zero as the number of samples grows.

###### Lemma 4.

For any Ising model with spins, with maximum degree and maximum coupling intensity . For any , if the number of observation satisfies , then the following bound holds with probability at least :

 ∥∇Sn(θ–∗u)∥∞≤2 ⎷ln2pϵ3n. (26)
###### Proof.

Let us first show that every term is individually bounded by the RHS of (26) with high-probability. We further use the union bound to prove that all components are uniformly bounded with high-probability. Utilizing Lemma 1, Lemma 2 and Lemma 3 we apply the Bernstein’s Inequality

 (27)

Inverting the following relation

 s=12t2n1+13exp(βd)t, (28)

and substituting the result in the Eq. (27) one derives

 (29)

where

For , we can simplify Eq. (29) to have an expression independent of and

 (30)

Using and the union bound on every component of the gradient leads to the desired result. ∎

### 3.2 Restricted Strong-Convexity

The remainder of the first-order Taylor-expansion of the ISO, defined in Eq. (15) is explicitly computed

 δSn(Δu,θ∗)=1nn∑k=1exp(−∑i∈∂uθ∗uiσ(k)uσ(k)i)f⎛⎝∑i∈V∖uΔuiσ(k)uσ(k)i⎞⎠, (31)

where the function appearing in Eq. (31) reads

 f(z):=e−z−1+z. (32)

In the following lemma we prove that Eq. (31) is controlled by a much simpler expression using a lower-bound on Eq. (32).

###### Lemma 5.

For all , the remainder of the first-order Taylor expansion admits the following lower-bound

 δSn(Δu,θ∗) ≥e−βd2+∥Δu∥1Δ⊤uHnΔu (33)

where the matrix is an empirical covariance matrix with elements

 Hnij=1nn∑k=1σ(k)iσ(k)j. (34)
###### Proof.

We start to prove a lower-bound on the function valid for all ,

 f(z)≥z22+|z|. (35)

To see this, define an auxiliary function as follows

 g(z): =(2+|z|)f(z)−z2 =(2+|z|)(e−z−1+z)−z2. (36)

We show that achieves its minimum at Observe that the first derivative of vanishes at zero from both the negative and positive side

 limz→0+ddzg(z) =limz→0−ddzg(z)=0. (37)

Moreover for all the second derivative of is non-negative

 d2dz2g(z)=ze−z>0. (38)

A similar result holds for

 d2dz2g(z)=4(e−z−1)−ze−z>0, (39)

proving that for all , .

Combining Eq. (35) with the straightforward inequalities

 ∣∣ ∣∣∑i∈V∖uΔuiσ(k)uσ(k)i∣∣ ∣∣≤∥Δu∥1, (40)

and

 exp(−∑i∈∂uθ∗uiσ(k)uσ(k)i)≥exp(−βd), (41)

leads us to the following lower-bound on the remainder of the first-order Taylor expansion of the ISO

 δSn(Δu,θ∗) ≥e−βd2+∥Δu∥11nn∑k=1⎛⎝∑i∈V∖uΔuiσ(k)uσ(k)i⎞⎠2 =e−βd2+∥Δu∥1Δ⊤uHnΔu, (42)

where in the last line we used the trivial identity . ∎

Lemma 5 enables us to control the randomness in through the simpler matrix that is independent of . This last point is crucial as we show in the next lemma that concentrates independently of towards its mean.

###### Lemma 6.

Consider an Ising model with spins, with maximum degree and maximum coupling intensity . Let , and . Then with probability greater than , we have for all

 ∣∣Hnij−Hij∣∣≤δ, (43)

where the matrix is the covariance matrix with elements

 Hij=E[σiσj]. (44)
###### Proof.

We recall that the matrix elements of the empirical covariance matrix read

 Hnij=1nn∑k=1σ(k)iσ(k)j. (45)

Since using Hoeffding’s inequality, we have

 (46)

As is symmetric we use the union bound over the elements to get

 P[∣∣Hnij−Hij∣∣≥δ∀i,j∈V∖u]≤1−p2exp(−nδ22). (47)

The last ingredient that we need is a proof that the smallest eigenvalue of the covariance matrix

is bounded away from zero independently of the dimension . Equivalently the next lemma shows that the quadratic form associated with is non-degenerate regardless of the value of .

###### Lemma 7.

Consider an Ising model with spins, with maximum degree and maximum coupling intensity . For all the following bound holds

 Δ⊤uHΔu≥e−2βdd+1∥Δu∥22. (48)
###### Proof.

Our proof strategy here follows [16, Cor. 3.1]. Notice that the probability measure of the Ising model is symmetric with respect to the sign flip, i.e. . Thus any spin has zero mean, which implies that for every

 E⎡⎣⎛⎝∑i∈V∖uΔuiσi⎞⎠⎤⎦ = 0 (49)

This allows to reinterpret the left-hand side of Eq. (48) as a variance, using that

 Δ⊤uHΔu = ∑i,j∈V∖uΔuiE[σiσj]Δuj (50) = E⎡⎢⎣⎛⎝∑i∈V∖uΔuiσi⎞⎠2⎤⎥⎦ = Var⎡⎣∑i∈V∖uΔuiσi⎤⎦.

Construct a subset recursively as follows: (i) let and define , (ii) given , let and and set , (iii) terminate when and declare .

The set possesses the following two main properties. First, every node does not have any neighbors in and, second,

 (d+1)∑i∈AΔ2ui≥∑i∈V∖uΔ2ui. (51)

We apply the law of total variance to (50) by conditioning on the set of spins with indexes belonging to the complementary set ,

 Var⎡⎣∑i∈V∖uΔuiσi⎤⎦ ≥ E⎡⎣Var⎡⎣∑i∈V∖uΔuiσi∣∣ ∣∣σ––Ac⎤⎦⎤⎦ (52) = ∑i∈AΔ2uiE[Var[σi∣σ––Ac]],

where in the last line one uses that the spins in are conditionally independent given their neighbors . One concludes the proof by using relation (51) and the fact that the conditional variance of a spin given its neighbors is bounded from below:

 Var[σi∣σ––Ac] = 1−tanh2⎛⎝∑j∈∂iθ∗ijσj⎞⎠ (53) ≥ exp(−2βd).

We stress that Lemma 7 is a deterministic result valid for all . We are now in position to prove the restricted strong convexity of the ISO.

###### Lemma 8.

Consider an Ising model with spins, with maximum degree and maximum coupling intensity . For all and , when the ISO (8) satisfies, with probability at least , the restricted strong convexity condition

 δSn(Δu,θ∗u)≥e−3βd4(d+1)(1+2√dR)∥Δu∥22, (54)

for all such that and .

###### Proof.

First we apply Lemma 5 to get the quadratic bound

 δSn(Δu,θ∗) ≥e−βd2+∥Δu∥1Δ⊤uHnΔu ≥e−βd2(1+2√dR)Δ⊤uHnΔu. (55)

Second we use Lemma 7 to bound the quadratic form

 Δ⊤uHnΔu =Δ⊤uHΔu+Δ⊤u(Hn−H)Δu ≥e−2βdd+1∥Δu∥22+Δ⊤u(Hn−H)Δu. (56)

Third we conclude with Lemma 6, controlling randomness independently of . Choosing , we get with probability at least that

 Δ⊤u(Hn−H)Δu ≥−e−2βd32d(d+1)∥Δu∥21 ≥−e−2βd2(d+1)∥Δu∥22, (57)

whenever . ∎

### 3.3 Proof of the main Theorems

###### Proof of Theorem 1 (Square Error of RISE).

We seek to apply Proposition 1 to the Regularized Interaction Screening Estimator (9). Using in Lemma 4 and letting , it follows that Condition 1 is satisfied with probability greater than , whenever .

Using in Lemma 8, and observing that for and , we conclude that condition 2 is satisfied with probability greater than . Theorem 1 then follows by using a union bound and then applying Proposition 1. ∎

The proof of Theorem 2 becomes an immediate application of Theorem 1.

###### Proof of Theorem 2 (Structure Learning of Ising Models).

According to Theorem 1, one observes that, with probability , the minimal amount of samples required to achieve an error of on every coupling around a single node is

 n≥max(d/16,α−2)218d(d+1)2e6βdln3p2ϵ1. (58)

Let us choose and use the union-bound to ensure that the couplings at every node (thresholded by ) are simultaneously recovered with probability greater than . ∎

## 4 Numerical Results

We test performance of the Struct-RISE, with the strength of the -regularization parametrized by , on Ising models over two-dimensional grid with periodic boundary conditions (thus degree of every node in the graph is ). We have observed that this topology is one of the hardest for the reconstruction problem. We are interested to find the minimal number of samples, , such that the graph is perfectly reconstructed with probability . In our numerical experiments, we recover the value of as the minimal for which Struct-RISE outputs the perfect structure 45 times from 45 different trials with samples, thus guaranteeing that the probability of perfect reconstruction is greater than with a statistical confidence of at least .

We first verify the logarithmic scaling of with respect to the number of spins . The couplings are chosen uniform and positive . This choice ensures that samples generated by Glauber dynamics are i.i.d. according to (1). Values of for are shown on the left in Figure 1. Empirical scaling is, , which is orders of magnitude better than the rather conservative prediction of the theory for this model, .

We also test the exponential scaling of with respect to the maximum coupling intensity . The test is conducted over two different settings both with spins: the ferromagnetic case where all couplings are uniform and positive, and the spin glass case where the sign of couplings is assigned uniformly at random. In both cases the absolute value of the couplings, , is uniform and equal to . To ensure that the samples are i.i.d, we sample directly from the exhaustive weighted list of the possible spin configurations. The structure is recovered by thresholding the reconstructed couplings at the value .

Experimental values of for different values of the maximum coupling intensity, , are shown on the right in Fig. 1. Empirically observed exponential dependence on is matched best by, , in the ferromagnetic case and by, , in the case of the spin glass. Theoretical bound for predicts . We observe that the difference in sample complexity depends significantly on the type of interaction. An interesting observation one can make based on these experiments is that the case which is harder from the sample-generating perspective is easier for learning and vice versa.

## 5 Conclusions and Path Forward

In this paper we construct and analyze the Regularized Interaction Screening Estimator (9). We show that the estimator is computationally efficient and needs an optimal number of samples for learning Ising models. The RISE estimator does not require any prior knowledge about the model parameters for implementation and it is based on the minimization of the loss function (8), that we call the Interaction Screening Objective. The ISO is an empirical average (over samples) of an objective designed to screen an individual spin/variable from its factor-graph neighbors.

Even though we focus in this paper solely on learning pair-wise binary models, the “interaction screening” approach we introduce here is generic. The approach extends to learning other Graphical Models, including those over higher (discrete, continuous or mixed) alphabets and involving high-order (beyond pair-wise) interactions. These generalizations are built around the same basic idea pioneered in this paper – the interaction screening objective is (a) minimized over candidate GM parameters at the actual values of the parameters we aim to learn; and (b) it is an empirical average over samples. In the future, we plan to explore further theoretical and experimental power, characteristics and performance of the generalized screening estimator.

## Acknowledgment

We are thankful to Guy Bresler and Andrea Montanari for valuable discussions, comments and insights. The work was supported by funding from the U.S. Department of Energy’s Office of Electricity as part of the DOE Grid Modernization Initiative.

## References

• [1] D. Marbach, J. C. Costello, R. Kuffner, N. M. Vega, R. J. Prill, D. M. Camacho, K. R. Allison, M. Kellis, J. J. Collins, and G. Stolovitzky, “Wisdom of crowds for robust gene network inference,” Nat Meth, vol. 9, pp. 796–804, Aug 2012.
• [2] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, and M. Weigt, “Direct-coupling analysis of residue coevolution captures native contacts across many protein families,” Proceedings of the National Academy of Sciences, vol. 108, no. 49, pp. E1293–E1301, 2011.
• [3] E. Schneidman, M. J. Berry, R. Segev, and W. Bialek, “Weak pairwise correlations imply strongly correlated network states in a neural population,” Nature, vol. 440, pp. 1007–1012, Apr 2006.
• [4] S. Roth and M. J. Black, “Fields of experts: a framework for learning image priors,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, pp. 860–867 vol. 2, June 2005.
• [5] N. Eagle, A. S. Pentland, and D. Lazer, “Inferring friendship network structure by using mobile phone data,” Proceedings of the National Academy of Sciences, vol. 106, no. 36, pp. 15274–15278, 2009.
• [6] M. He and J. Zhang, “A dependency graph approach for fault detection and localization towards secure smart grid,” IEEE Transactions on Smart Grid, vol. 2, pp. 342–351, June 2011.
• [7] D. Deka, S. Backhaus, and M. Chertkov, “Structure learning and statistical estimation in distribution networks,” submitted to IEEE Control of Networks; arXiv:1501.04131; arXiv:1502.07820, 2015.
• [8] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees,” IEEE Transactions on Information Theory, vol. 14, pp. 462–467, May 1968.
• [9] A. d’Aspremont, O. Banerjee, and L. E. Ghaoui, “First-order methods for sparse covariance selection,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 1, pp. 56–66, 2008.
• [10] J. K. Johnson, D. Oyen, M. Chertkov, and P. Netrapalli, “Learning planar ising models,” Journal of Machine Learning, in press; arXiv:1502.00916, 2015.
• [11]

T. Tanaka, “Mean-field theory of Boltzmann machine learning,”

Phys. Rev. E, vol. 58, pp. 2302–2310, Aug 1998.
• [12] H. J. Kappen and F. d. B. Rodríguez, “Efficient learning in Boltzmann machines using linear response theory,” Neural Computation, vol. 10, no. 5, pp. 1137–1156, 1998.
• [13] Y. Roudi, J. Tyrcha, and J. Hertz, “Ising model for neural data: Model quality and approximate methods for extracting functional connectivity,” Phys. Rev. E, vol. 79, p. 051915, May 2009.
• [14] F. Ricci-Tersenghi, “The Bethe approximation for solving the inverse Ising problem: a comparison with other inference methods,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2012, no. 08, p. P08015, 2012.
• [15] G. Bresler, D. Gamarnik, and D. Shah, “Hardness of parameter estimation in graphical models,” in Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), pp. 1062–1070, Curran Associates, Inc., 2014.
• [16] A. Montanari, “Computational implications of reducing data to sufficient statistics,” Electron. J. Statist., vol. 9, no. 2, pp. 2370–2390, 2015.
• [17] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, “High-dimensional Ising model selection using

1-regularized logistic regression,”

Ann. Statist., vol. 38, pp. 1287–1319, 06 2010.
• [18] S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, “A unified framework for high-dimensional analysis of -estimators with decomposable regularizers,” Statist. Sci., vol. 27, pp. 538–557, 11 2012.
• [19] A. Montanari and J. A. Pereira, “Which graphical models are difficult to learn?,” in Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, eds.), pp. 1303–1311, Curran Associates, Inc., 2009.
• [20] G. Bresler, E. Mossel, and A. Sly, “Reconstruction of Markov random fields from samples: Some observations and algorithms,” SIAM Journal on Computing, vol. 42, no. 2, pp. 563–578, 2013.
• [21] G. Bresler, “Efficiently learning Ising models on arbitrary graphs,” in

Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing

, pp. 771–782, ACM, 2015.
• [22] N. P. Santhanam and M. J. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions,” IEEE Transactions on Information Theory, vol. 58, pp. 4117–4134, July 2012.
• [23]

A. Agarwal, S. Negahban, and M. J. Wainwright, “Fast global convergence of gradient methods for high-dimensional statistical recovery,”

Ann. Statist., vol. 40, pp. 2452–2482, 10 2012.