A Graphical Model (GM) describes a probability distribution over a set of random variables which factorizes over the edges of a graph. It is of interest to recover the structure of GMs from random samples. The graphical structure contains valuable information on the dependencies between the random variables. In fact, the neighborhood of a random variable is the minimal set that provides us maximum information about this variable. Unsurprisingly, GM reconstruction plays an important role in various fields such as the study of gene expression, protein interactions , neuroscience , image processing , sociology  and even grid science [6, 7].
The origin of the GM reconstruction problem is traced back to the seminal 1968 paper by Chow and Liu , where the problem was posed and resolved for the special case of tree-structured GMs. In this special tree case the maximum likelihood estimator is tractable and is tantamount to finding a maximum weighted spanning-tree. However, it is also known that in the case of general graphs with cycles, maximum likelihood estimators are intractable as they require computation of the partition function of the underlying GM, with notable exceptions of the Gaussian GM, see for instance , and some other special cases, like planar Ising models without magnetic field .
A lot of efforts in this field has focused on learning Ising models, which are the most general GMs over binary variables with pairwise interaction/factorization. Early attempts to learn the Ising model structure efficiently were heuristic, based on various mean-field approximations, e.g. utilizing empirical correlation matrices[11, 12, 13, 14]. These methods were satisfactory in cases when correlations decrease with graph distance. However it was also noticed that the mean-field methods perform poorly for the Ising models with long-range correlations. This observation is not surprising in light of recent results stating that learning the structure of Ising models using only their correlation matrix is, in general, computationally intractable [15, 16].
Among methods that do not rely solely on correlation matrices but take advantage of higher-order correlations that can be estimated from samples, we mention the approach based on sparsistency of the so-called regularized pseudo-likelihood estimator . This estimator, like the one we propose in this paper, is from the class of M-estimators i.e. estimators that are the minimum of a sum of functions over the sampled data . The regularized pseudo-likelihood estimator is regarded as a surrogate for the intractable likelihood estimator with an additive -norm penalty to encourage sparsity of the reconstructed graph. The sparsistency-based estimator offers guarantees for the structure reconstruction, but the result only applies to GMs that satisfy a certain condition that is rather restrictive and hard to verify. It was also proven that the sparsity pattern of the regularized pseudo-likelihood estimator fails to reconstruct the structure of graphs with long-range correlations, even for simple test cases .
Principal tractability of structure reconstruction of an arbitrary Ising model from samples was proven only very recently. Bresler, Mossel and Sly in  suggested an algorithm which reconstructs the graph without errors in polynomial time. They showed that the algorithm requires number of samples that is logarithmic in the number of variables. Although this algorithm is of a polynomial complexity, it relies on an exhaustive neighborhood search, and the degree of the polynomial is equal to the maximal node degree.
Prior to the work reported in this manuscript the best known procedure for perfect reconstruction of an Ising model was through a greedy algorithm proposed by Bresler in . Bresler’s algorithm is based on the observation that the mutual information between neighboring nodes in an Ising model is lower bounded. This observation allows to reconstruct the Ising graph perfectly with only a logarithmic number of samples and in time quasi-quadratic in the number of variables. On the other hand, Bresler’s algorithm suffers from two major practical limitations. First, the number of samples, hence the running time as well, scales double exponentially with respect to the largest node degree and with respect to the largest coupling intensity between pairs of variables. This scaling is rather far from the information-theoretic lower-bound reported in  predicting instead a single exponential dependency on the two aforementioned quantities. Second, Bresler’s algorithm requires prior information on the maximum and minimum coupling intensities as well as on the maximum node degree, guarantees which, in reality, are not necessarily available.
In this paper we propose a novel estimator for the graph structure of an arbitrary Ising model which achieves perfect reconstruction in quasi-quartic time (although we believe it can be provably reduced to quasi-quadratic time) and with a number of samples logarithmic in the system size. The algorithm is near-optimal in the sense that the number of samples required to achieve perfect reconstruction, and the run time, scale exponentially with respect to the maximum node-degree and the maximum coupling intensity, thus matching parametrically the information-theoretic lower bound of . Our statistical estimator has the structure of a consistent M-estimator implemented via convex optimization with an additional thresholding procedure. Moreover it allows intuitive interpretation in terms of what we coin the “interaction screening”. We show that with a proper -regularization our estimator reconstructs couplings of an Ising model from a number of samples that is near-optimal. In addition, our estimator does not rely on prior information on the model characteristics, such as maximum coupling intensity and maximum degree.
The rest of the paper is organized as follows. In Section 2 we give a precise definition of the structure estimation problem for the Ising models and we describe in detail our method for structure reconstruction within the family of Ising models. The main results related to the reconstruction guarantees are provided by Theorem 1 and Theorem 2. In Section 3 we explain the strategy and the sequence of steps that we use to prove our main theorems. Proofs of Theorem 1 and Theorem 2 are summarized at the end of this Section. Section 4 illustrates performance of our reconstruction algorithm via simulations. Here we show on a number of test cases that the sample complexity of the suggested method scales logarithmically with the number of variables and exponentially with the maximum coupling intensity. In Section 5 we discuss possible generalizations of the algorithm and future work.
2 Main Results
Consider a graph with vertexes where is the vertex set and is the undirected edge set. Vertexes are associated with binary random variables that are called spins. Edges are associated with non-zero real parameters that are called couplings. An Ising model is a probability distribution over spin configurations that reads as follows:
where is a normalization factor called the partition function.
Notice that even though the main innovation of this paper – the efficient “interaction screening” estimator – can be constructed for the most general Ising models, we restrict our attention in this paper to the special case of the Ising models with zero local magnetic-field. This simplification is not necessary and is done solely to simplify (generally rather bulky) algebra. Later in the text we will thus refer to the zero magnetic field model (2) simply as the Ising model.
2.1 Structure-Learning of Ising Models
Suppose that sequences/samples of spins are observed. Let us assume that each observed spin configuration is i.i.d. from (1). Based on these measurements/samples we aim to construct an estimator of the edge set that reconstructs the structure exactly with high probability, i.e.
where is a prescribed reconstruction error.
We are interested to learn structures of Ising models in the high-dimensional regime where the number of observations/samples is of the order . A necessary condition on the number of samples is given in [22, Thm. 1]. This condition depends explicitly on the smallest and largest coupling intensity
and on the maximal node degree
where the set of neighbors of a node is denoted by .
According to , in order to reconstruct the structure of the Ising model with minimum coupling intensity , maximum coupling intensity , and maximum degree , the required number of samples should be at least
We see from Eq. (6) that the exponential dependence on the degree and the maximum coupling intensity are both unavoidable. Moreover, when the minimal coupling is small, the number of samples should scale at least as .
Unfortunately, the existence proof presented in  is based on an exhaustive search with the intractable maximum likelihood estimator and thus it does not guarantee actual existence of an algorithm with low computational complexity. Notice also that the number of samples in (7) scales as when and are asymptotically large and as when is asymptotically small.
2.2 Regularized Interaction Screening Estimator
The main contribution of this paper consists in presenting explicitly a structure-learning algorithm that is of low complexity and which is near-optimal with respect to bounds (6) and (7). Our algorithm reconstructs the structure of the Ising model exactly, as stated in Eq. (3), with an error probability , and with a number of samples which is at most proportional to and . (See Theorem 1 and Theorem 2 below for mathematically accurate statements.) Our algorithm consists of two steps. First, we estimate couplings in the vicinity of every node. Then, on the second step, we threshold the estimated couplings that are sufficiently small to zero. Resulting zero coupling means that the corresponding edge is not present.
Denote the set of couplings around node
by the vector. In this, slightly abusive notation, we use the convention that if a coupling is equal to zero it reads as absence of the edge, i.e. if and only if . Note that if the node degree is bounded by , it implies that the vector of couplings is non-zero in at most entries.
Our estimator for couplings around node
is based on the following loss function coined the Interaction Screening Objective (ISO):
The ISO is an empirical weighted-average and its gradient is the vector of weighted pair-correlations involving . At the exponential weight cancels exactly with the corresponding factor in the distribution (1). As a result, weighted pair-correlations involving vanish as if was uncorrelated with any other spins or completely “screened” from them, which explains our choice for the name of the loss function. This remarkable “screening” feature of the ISO suggests the following choice of the Regularized Interaction Screening Estimator (RISE) for the interaction vector around node :
where is a tunable parameter promoting sparsity through the additive -penalty. Notice that the ISO is the empirical average of an exponential function of which implies it is convex. Moreover, addition of the -penalty preserves the convexity of the minimization objective in Eq. (9).
As expected, the performance of RISE does depend on the choice of the penalty parameter . If is too small is too sensitive to statistical fluctuations. On the other hand, if is too large has too much of a bias towards zero. In general, the optimal value of is hard to guess. Luckily, the following theorem provides strong guarantees on the square error for the case when is chosen to be sufficiently large.
Theorem 1 (Square Error of RISE).
Let be realizations of spins drawn i.i.d. from an Ising model with maximum degree and maximum coupling intensity . Then for any node and for any , the square error of the Regularized Interaction Screening Estimator (9) with penalty parameter is bounded with probability at least by
Our structure estimator (for the second step of the algorithm), Structure-RISE, takes RISE output and thresholds couplings whose absolute value is less than to zero:
Performance of the Structure-RISE is fully quantified by the following Theorem.
Theorem 2 (Structure Learning of Ising Models).
Let be realizations of spins drawn i.i.d. from an Ising model with maximum degree , maximum coupling intensity and minimal coupling intensity . Then for any , Structure-RISE with penalty parameter reconstructs the edge-set perfectly with probability
Theorem 1 states that RISE recovers not only the structure but also the correct value of the couplings up to an error based on the available samples. It is possible to improve the square-error bound (10) even further by first, running Structure-RISE to recover edges, and then re-running RISE with for the remaining non-zero couplings.
The computational complexity of RISE is equal to the complexity of minimizing the convex ISO and, as such, it scales at most as . Therefore, computational complexity of Structure-RISE scales at most as simply because one has to call RISE at every node. We believe that this running-time estimate can be proven to be quasi-quadratic when using first-order minimization-techniques, in the spirit of . We have observed through numerical experiments that such techniques implement Structure-RISE with running-time .
Notice that in order to implement RISE there is no need for prior knowledge on the graph parameters. This is a considerable advantage in practical applications where the maximum degree or bounds on couplings are often unknown.
The Regularized Interaction Screening Estimator (9) is from the class of the so-called regularized M-estimators. Negahban et al. proposed in  a framework to analyze the square error of such estimators. As per , enforcing only two conditions on the loss function is sufficient to get a handle on the square error of an -regularized M-estimator.
The first condition links the choice of the penalty parameter to the gradient of the objective function.
The -penalty parameter strongly enforces regularization if it is greater than any partial derivatives of the objective function at , i.e.
Condition 1 guarantees that if the vector of couplings has at most non-zero entries, then the estimation difference lies within the set
The second condition ensure that the objective function is strongly convex in a restricted subset of . Denote the reminder of the first-order Taylor expansion of the objective function by
where is an arbitrary vector. Then the second condition reads as follows.
The objective function is restricted strongly convex with respect to on a ball of radius centered at , if for all such that , there exists a constant such that
3.1 Gradient Concentration
Like the ISO (8), its gradient in any component is an empirical average
where the random variables are i.i.d and they are related to the spin configurations according to
In order to prove that the ISO gradient concentrates we have to state few properties of the support, the mean and the variance of the random variables (19), expressed in the following three Lemmas.
The first of the Lemmas states that at , the random variable has zero mean.
For any Ising model with spins and for all
By direct computation, we find that
where in the last line we use the fact that the exponential terms involving cancel, implying that the sum over is zero. ∎
The second Lemma proves that at , the random variable has a variance equal to one.
For any Ising model with spins and for all
As a result of direct evaluation one derives
Notice that in the second line the first sum over edges (under the exponential) does not depend on . Furthermore, the first sum is invariant under the change of variables, , while the second sum changes sign. This transformation results in appearance of the partition function in the numerator. ∎
The next lemma states that at , the random variable has a bounded support.
For any Ising model with spins, with maximum degree and maximum coupling intensity , it is guaranteed that for all
Observe that components of are smaller than and at most of them are non-zero. Recall that spins are binary, , which results in the following estimate
For any Ising model with spins, with maximum degree and maximum coupling intensity . For any , if the number of observation satisfies , then the following bound holds with probability at least :
Let us first show that every term is individually bounded by the RHS of (26) with high-probability. We further use the union bound to prove that all components are uniformly bounded with high-probability. Utilizing Lemma 1, Lemma 2 and Lemma 3 we apply the Bernstein’s Inequality
Inverting the following relation
and substituting the result in the Eq. (27) one derives
For , we can simplify Eq. (29) to have an expression independent of and
Using and the union bound on every component of the gradient leads to the desired result. ∎
3.2 Restricted Strong-Convexity
The remainder of the first-order Taylor-expansion of the ISO, defined in Eq. (15) is explicitly computed
where the function appearing in Eq. (31) reads
For all , the remainder of the first-order Taylor expansion admits the following lower-bound
where the matrix is an empirical covariance matrix with elements
We start to prove a lower-bound on the function valid for all ,
To see this, define an auxiliary function as follows
We show that achieves its minimum at Observe that the first derivative of vanishes at zero from both the negative and positive side
Moreover for all the second derivative of is non-negative
A similar result holds for
proving that for all , .
Combining Eq. (35) with the straightforward inequalities
leads us to the following lower-bound on the remainder of the first-order Taylor expansion of the ISO
where in the last line we used the trivial identity . ∎
Lemma 5 enables us to control the randomness in through the simpler matrix that is independent of . This last point is crucial as we show in the next lemma that concentrates independently of towards its mean.
Consider an Ising model with spins, with maximum degree and maximum coupling intensity . Let , and . Then with probability greater than , we have for all
where the matrix is the covariance matrix with elements
We recall that the matrix elements of the empirical covariance matrix read
Since using Hoeffding’s inequality, we have
As is symmetric we use the union bound over the elements to get
The last ingredient that we need is a proof that the smallest eigenvalue of the covariance matrixis bounded away from zero independently of the dimension . Equivalently the next lemma shows that the quadratic form associated with is non-degenerate regardless of the value of .
Consider an Ising model with spins, with maximum degree and maximum coupling intensity . For all the following bound holds
Our proof strategy here follows [16, Cor. 3.1]. Notice that the probability measure of the Ising model is symmetric with respect to the sign flip, i.e. . Thus any spin has zero mean, which implies that for every
This allows to reinterpret the left-hand side of Eq. (48) as a variance, using that
Construct a subset recursively as follows: (i) let and define , (ii) given , let and and set , (iii) terminate when and declare .
The set possesses the following two main properties. First, every node does not have any neighbors in and, second,
We apply the law of total variance to (50) by conditioning on the set of spins with indexes belonging to the complementary set ,
where in the last line one uses that the spins in are conditionally independent given their neighbors . One concludes the proof by using relation (51) and the fact that the conditional variance of a spin given its neighbors is bounded from below:
We stress that Lemma 7 is a deterministic result valid for all . We are now in position to prove the restricted strong convexity of the ISO.
Consider an Ising model with spins, with maximum degree and maximum coupling intensity . For all and , when the ISO (8) satisfies, with probability at least , the restricted strong convexity condition
for all such that and .
3.3 Proof of the main Theorems
Proof of Theorem 1 (Square Error of RISE).
We seek to apply Proposition 1 to the Regularized Interaction Screening Estimator (9). Using in Lemma 4 and letting , it follows that Condition 1 is satisfied with probability greater than , whenever .
Proof of Theorem 2 (Structure Learning of Ising Models).
According to Theorem 1, one observes that, with probability , the minimal amount of samples required to achieve an error of on every coupling around a single node is
Let us choose and use the union-bound to ensure that the couplings at every node (thresholded by ) are simultaneously recovered with probability greater than . ∎
4 Numerical Results
We test performance of the Struct-RISE, with the strength of the -regularization parametrized by , on Ising models over two-dimensional grid with periodic boundary conditions (thus degree of every node in the graph is ). We have observed that this topology is one of the hardest for the reconstruction problem. We are interested to find the minimal number of samples, , such that the graph is perfectly reconstructed with probability . In our numerical experiments, we recover the value of as the minimal for which Struct-RISE outputs the perfect structure 45 times from 45 different trials with samples, thus guaranteeing that the probability of perfect reconstruction is greater than with a statistical confidence of at least .
We first verify the logarithmic scaling of with respect to the number of spins . The couplings are chosen uniform and positive . This choice ensures that samples generated by Glauber dynamics are i.i.d. according to (1). Values of for are shown on the left in Figure 1. Empirical scaling is, , which is orders of magnitude better than the rather conservative prediction of the theory for this model, .
We also test the exponential scaling of with respect to the maximum coupling intensity . The test is conducted over two different settings both with spins: the ferromagnetic case where all couplings are uniform and positive, and the spin glass case where the sign of couplings is assigned uniformly at random. In both cases the absolute value of the couplings, , is uniform and equal to . To ensure that the samples are i.i.d, we sample directly from the exhaustive weighted list of the possible spin configurations. The structure is recovered by thresholding the reconstructed couplings at the value .
Experimental values of for different values of the maximum coupling intensity, , are shown on the right in Fig. 1. Empirically observed exponential dependence on is matched best by, , in the ferromagnetic case and by, , in the case of the spin glass. Theoretical bound for predicts . We observe that the difference in sample complexity depends significantly on the type of interaction. An interesting observation one can make based on these experiments is that the case which is harder from the sample-generating perspective is easier for learning and vice versa.
5 Conclusions and Path Forward
In this paper we construct and analyze the Regularized Interaction Screening Estimator (9). We show that the estimator is computationally efficient and needs an optimal number of samples for learning Ising models. The RISE estimator does not require any prior knowledge about the model parameters for implementation and it is based on the minimization of the loss function (8), that we call the Interaction Screening Objective. The ISO is an empirical average (over samples) of an objective designed to screen an individual spin/variable from its factor-graph neighbors.
Even though we focus in this paper solely on learning pair-wise binary models, the “interaction screening” approach we introduce here is generic. The approach extends to learning other Graphical Models, including those over higher (discrete, continuous or mixed) alphabets and involving high-order (beyond pair-wise) interactions. These generalizations are built around the same basic idea pioneered in this paper – the interaction screening objective is (a) minimized over candidate GM parameters at the actual values of the parameters we aim to learn; and (b) it is an empirical average over samples. In the future, we plan to explore further theoretical and experimental power, characteristics and performance of the generalized screening estimator.
We are thankful to Guy Bresler and Andrea Montanari for valuable discussions, comments and insights. The work was supported by funding from the U.S. Department of Energy’s Office of Electricity as part of the DOE Grid Modernization Initiative.
-  D. Marbach, J. C. Costello, R. Kuffner, N. M. Vega, R. J. Prill, D. M. Camacho, K. R. Allison, M. Kellis, J. J. Collins, and G. Stolovitzky, “Wisdom of crowds for robust gene network inference,” Nat Meth, vol. 9, pp. 796–804, Aug 2012.
-  F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, and M. Weigt, “Direct-coupling analysis of residue coevolution captures native contacts across many protein families,” Proceedings of the National Academy of Sciences, vol. 108, no. 49, pp. E1293–E1301, 2011.
-  E. Schneidman, M. J. Berry, R. Segev, and W. Bialek, “Weak pairwise correlations imply strongly correlated network states in a neural population,” Nature, vol. 440, pp. 1007–1012, Apr 2006.
-  S. Roth and M. J. Black, “Fields of experts: a framework for learning image priors,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, pp. 860–867 vol. 2, June 2005.
-  N. Eagle, A. S. Pentland, and D. Lazer, “Inferring friendship network structure by using mobile phone data,” Proceedings of the National Academy of Sciences, vol. 106, no. 36, pp. 15274–15278, 2009.
-  M. He and J. Zhang, “A dependency graph approach for fault detection and localization towards secure smart grid,” IEEE Transactions on Smart Grid, vol. 2, pp. 342–351, June 2011.
-  D. Deka, S. Backhaus, and M. Chertkov, “Structure learning and statistical estimation in distribution networks,” submitted to IEEE Control of Networks; arXiv:1501.04131; arXiv:1502.07820, 2015.
-  C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees,” IEEE Transactions on Information Theory, vol. 14, pp. 462–467, May 1968.
-  A. d’Aspremont, O. Banerjee, and L. E. Ghaoui, “First-order methods for sparse covariance selection,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 1, pp. 56–66, 2008.
-  J. K. Johnson, D. Oyen, M. Chertkov, and P. Netrapalli, “Learning planar ising models,” Journal of Machine Learning, in press; arXiv:1502.00916, 2015.
T. Tanaka, “Mean-field theory of Boltzmann machine learning,”Phys. Rev. E, vol. 58, pp. 2302–2310, Aug 1998.
-  H. J. Kappen and F. d. B. Rodríguez, “Efficient learning in Boltzmann machines using linear response theory,” Neural Computation, vol. 10, no. 5, pp. 1137–1156, 1998.
-  Y. Roudi, J. Tyrcha, and J. Hertz, “Ising model for neural data: Model quality and approximate methods for extracting functional connectivity,” Phys. Rev. E, vol. 79, p. 051915, May 2009.
-  F. Ricci-Tersenghi, “The Bethe approximation for solving the inverse Ising problem: a comparison with other inference methods,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2012, no. 08, p. P08015, 2012.
-  G. Bresler, D. Gamarnik, and D. Shah, “Hardness of parameter estimation in graphical models,” in Advances in Neural Information Processing Systems 27 (Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, eds.), pp. 1062–1070, Curran Associates, Inc., 2014.
-  A. Montanari, “Computational implications of reducing data to sufficient statistics,” Electron. J. Statist., vol. 9, no. 2, pp. 2370–2390, 2015.
P. Ravikumar, M. J. Wainwright, and J. D. Lafferty, “High-dimensional Ising
model selection using
1-regularized logistic regression,”Ann. Statist., vol. 38, pp. 1287–1319, 06 2010.
-  S. N. Negahban, P. Ravikumar, M. J. Wainwright, and B. Yu, “A unified framework for high-dimensional analysis of -estimators with decomposable regularizers,” Statist. Sci., vol. 27, pp. 538–557, 11 2012.
-  A. Montanari and J. A. Pereira, “Which graphical models are difficult to learn?,” in Advances in Neural Information Processing Systems 22 (Y. Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta, eds.), pp. 1303–1311, Curran Associates, Inc., 2009.
-  G. Bresler, E. Mossel, and A. Sly, “Reconstruction of Markov random fields from samples: Some observations and algorithms,” SIAM Journal on Computing, vol. 42, no. 2, pp. 563–578, 2013.
G. Bresler, “Efficiently learning Ising models on arbitrary graphs,” in
Proceedings of the Forty-Seventh Annual ACM on Symposium on Theory of Computing, pp. 771–782, ACM, 2015.
-  N. P. Santhanam and M. J. Wainwright, “Information-theoretic limits of selecting binary graphical models in high dimensions,” IEEE Transactions on Information Theory, vol. 58, pp. 4117–4134, July 2012.
A. Agarwal, S. Negahban, and M. J. Wainwright, “Fast global convergence of gradient methods for high-dimensional statistical recovery,”Ann. Statist., vol. 40, pp. 2452–2482, 10 2012.