Probabilistic graphical models are intriguing because they are employed for discriminative problems and generative tasks . However, large-scale graphical models incur significant computational costs for both inference and learning. Notable ways to overcome this difficulty are statistical sampling and variational methods. However, such methods typically require iterative computations, and the number of iterations to obtain reasonably accurate results is not known in advance. In this paper, we propose a learning formula for a limited class of the probabilistic graphical models, i.e. Ising models. The proposed formula enables us to determine the optimal model parameters for the approximate objective function without iterations.
Some non-iterative methods for inverse Ising problems have been developed. For example, Bethe free energy, which is the objective function of a belief propagation algorithm, has been analyzed in the Ising model . Together with the linear response relation , an analytic expression of the correlation matrix has been obtained as a function of the model parameter of the interaction matrix. Surprisingly, it was later found that the analytic expression for the correlations can be solved inversely for the interaction matrix [9, 10]. This analytic inverse solution enables us to obtain the optimal interaction matrix in Bethe approximation without iterations.
In this study, instead of the Bethe approximation, we use the tree-reweighted (TRW) approximation to obtain an approximate solution for the inverse Ising problem. The TRW approximation was developed to improve the accuracy and convergence of a belief propagation algorithm . Differing from the ordinary belief propagation algorithm, TRW free energy is provably convex with respect to the variational parameters. Moreover, the partition function computed using the TRW approximation gives a rigorous upper bound of the exact partition function. Thus, TRW approximation gives a lower bound of the exact log-likelihood function when learning a model . Although the use of this lower bound for learning has been proposed in the previous study, it has not yet been applied to derive an inverse formula in Ising model.
The reminder of this paper is organized as follows. In Section II, we introduce the Ising model and its direct and inverse problems. In Section III, we briefly summarize the theorem and properties of TRW approximation. Using TRW free energy, we analytically obtain the inverse formula for the Ising model that optimizes the rigorous lower bound of the exact log-likelihood in Section IV. In Section V, we review related work and introduce previously derived inverse formulae. In Section VI, we describe experiments conducted to compare the proposed inverse formula and previous formulae. Conclusions and suggestions for future work are provided in Section VII.
Ii Ising model and Inverse Problem
Ising models are undirected probabilistic graphical models that describe pairwise interactions between binary spin variables . Here, let
be a set of binary variablesand be a set of edges . The energy function of an Ising model is defined by
are parameters that represent interaction strengths and biases (local external fields), respectively. Given the energy function, the probability distribution for the model is defined by the Boltzmann distribution
where the partition function
is defined as a function of the parameters , and the log-partition function .
In practical applications of Ising models, we are confronted by two difficult problems, i.e., the inference and learning problems. The former is referred to as a direct problem, and the latter is an inverse problem of the Ising model.
The direct problem is formulated as follows. With fixed model parameter , one wants to compute expectation values, such as and . Computation of these expectation values is generally intractable, because it includes summations over numerous combinations of spin variables. Note that this computational difficulty is equivalent to that of the partition function: If we know the function (or ), we can readily obtain the expectation values.
In the inverse problem, differing from the direct problem, the goal is to infer the optimal parameter when a dataset is given, where the optimal parameter maximizes the log-likelihood function:
Here, denotes an expectation value with respect to the given dataset .
Similar to the direct problem, the partition function plays an important role in the inverse problem. Because the log-likelihood function is concave with respect to (Section IV), the optimal parameters that maximize can be obtained by solving , i.e.,
These equations yield the well-known moment-matching conditions for exact maximum likelihood estimates .
Note that this reformulation of the inverse problem is an identity transformation of the problem, and solving Eqs. (5) and (6) for parameter remains infeasible because of the summation over all spin states in Eq.(3).
In the next section, we compute the moments in the model using approximation. Then, we solve the moment matching conditions for the interaction parameter , which yields the inverse formula as a function of the data statistics.
Iii Tree-reweighted Approximation
The TRW approximation  provides a systematic way to construct a rigorous upper bound of the partition function. Using this upper bound, approximate pseudo moments that enable us to solve the moment-matching conditions (5) and (6) approximately can be obtained.
Here, to introduce the TRW approximation, we first define a spanning tree of a given graph and a probability measure over a set of spanning trees. Then, we define TRW free energy and provide the main theorem of TRW approximation.
Iii-a Spanning trees
A spanning tree of a given graphical model is defined by a tree graph in which any two vertices in are connected via edges in and other vertices. Let be the set of all spanning trees of . Over the spanning trees in , we assign an arbitrary probability distribution holding non-negativity and normalization, i.e., and .
Next, we define a graphical model on each spanning tree . Using the indicator function ,
we define the connectivity matrix of the model with spanning tree as . By defining the parameter of the model with the spanning tree as , we can write the log partition function with the spanning tree as .
Finally, using the probability distribution , we define the edge appearance probabilities for an edge as follows:
Intuitively, represents a probability of the existence of the edge in a tree if is selected according to the probability .
Iii-B Tree-reweighted free energy
Relative to the development of the TRW approximation, we identify the convexity of the log partition function with respect to parameters . This is verified easily by computing the first and second derivatives, for example, with respect to ,
The second derivative is the covariance matrix of and . Here, the covariance matrix is positive semidefinite, which implies the convexity of .
Using the convexity of , the upper bound of can be obtained by applying Jensen’s inequality as follows:
where we used the fact that . A previous study  has demonstrated that the upper bound can be obtained as a solution of the variational problem that is defined by the TRW free energy.
In this variational problem, variational parameters are the pseudomarginals that satisfy
for all and .
When a valid edge appearance probability is given, the TRW free energy is constructed by the energy and entropy terms as
where the energy term and the entropy term are defined as
respectively. Here, we define the entropies of the pseudomarginals as and .
The following theorem is proven  using the TRW free energy.
For a given , the upper bound of Eq.(11) is obtained by the solution of the minimization problem:
Moreover, is convex with respect to : thus the optimal solution to this problem is unique.
Iv Approximate Solution to the Inverse Ising Problem
Iv-a Rigorous lower bound of the objective
In the TRW approximation, the exact objective function for learning (4) is approximated by
Note that, with reference to Theorem1, this approximated objective is a rigorous lower bound for the exact objective function:
Consequently, maximizing results in maximization of the lower bound of the exact objective function .
Since is a convex function, is also convex with respect to . This implies that is concave with respect to because the averaged energy term is linear in . Thus, the optimal solution is obtained by solving , which results in the pseudo-moment matching conditions,
Iv-B Analytic Solution in the TRW Approximation
Next, we give the explicit solution of the pseudo-moment matching conditions(21) and (22). Given that all random variables are binary, we parameterize pseudomarginals by mean value and covariance as follows:
By substituting these equations into the definition of TRW free energy (15), the optimal solutions are obtained by solving .
By applying cavity methods , we derive a self-consistency equation for by transforming the equations and . Following the derivation in Bethe approximation  , we obtain the self-consistency equation of TRW approximation as:
Now, consider the pseudo-moment matching conditions. Eq. (21) simply states that the mean value of the model should be matched to that computed by data, i.e., . We rewrite for simplicity. Note that this should not be confused with .
where we have defined the covariance of data . This equation states that, to include the information about the covariance of the data, we must know the derivative of the fixed-point solution .
Taking the derivative to the both sides of Eq.(25), we obtain the following.
Since the linear response relation (28) states that:
we obtain the following from Eq.(29), for any ,
Finally, replacing and solving this equation for , we obtain an approximate solution to the inverse Ising problem:
V Related Work
V-a Inference and learning in TRW approximation
The TRW approximation was introduced  to improve belief propagation algorithm for general graphs with loops. This belief propagation algorithm, which was first used for the exact inference method for the models without loops , was later applied to approximate inference of models with loops . Although there are many examples where belief propagation gives reasonable results, it has been found that the belief propagation algorithm cannot converge in some cases. Theoretically, it has been pointed out that Bethe free energy, which is the objective function of belief propagation , is not always convex for general models , which explains the divergence and cyclic behavior of the belief propagation iteration. On the other hand, the TRW free energy is convex with respect to the variational parameters. This fact improves the convergence of the iteration. Another advantage is that the TRW approximation can give the rigorous upper bound of the exact partition function.
An early attempt to use TRW approximation for learning graphical models can be found in the literature . The authors of  pointed out that TRW approximation gives a rigorous lower bound for the exact objective function (Eq.(20)), and that maximization of the TRW objective function can be achieved by using the pseudo-moment matching conditions (Eqs.(21) and (22)). In contrast to our approach, they used a variant of the belief propagation algorithm to compute the pseudo moments. Although this approach is widely applicable to various models other than Ising models, iterative computation is required until the belief propagation algorithm converges. Furthermore, convergence of the belief propagation algorithm is not guaranteed in general. In this paper, we avoid this difficulty by directly calculating the analytic expression for the pseudo moments that minimize the TRW free energy.
V-B Existing inverse formulae
Various inverse formulae for the inverse Ising problem have been developed, and some of them are presented in this subsection because they are compared to the proposed inverse formula in the next section.
In the independent-pair (IP) approximation , only effects of a pair of spins and are taken into account in computation of the interaction , which leads to the formula:
Note that the IP approximation corresponds to the Bethe approximation without using the linear response relation.
Another approach is to use the series of the mean-field approximations . The most advanced approximation used to derive the inverse formula is Bethe approximation . The result is given by setting in Eq. (32),
because the TRW free energy is reduced to Bethe free energy by setting . Note that this does not mean that the TRW approximation includes the Bethe approximation: The choice of is invalid unless the graph is a tree.
In , another inverse formula is developed using small-correlation expansion of the entropy function. After resumming over the relevant diagrams, the Sessak-Monasson (SM) approximation is obtained as follows:
Vi Numerical Results
In subsection VI-A, using the Ising models with the artificial parameters, we reconstruct the parameters by the inverse formulae. We compare the accuracy of the methods by measuring the reconstruction errors.
In subsection VI-B, using the real-world neural spike train data, we estimate effective synaptic connections between the recorded neurons as an interaction matrix of an Ising model . Because we do not know the exact interactions matrices, we evaluate the accuracy of the inverse formulae by comparing the inferred interaction matrices with that computed by the gradient ascent algorithm.
Vi-a Artificial data
To evaluate the accuracy of the proposed formula, we attempted to reconstruct an interaction matrix from the statistics of sampled data. The experimental method is described as follows: We built Ising models with fixed graph structures and parameters. Using the Monte Carlo sampling method, we computed the mean value and covariance in the models. Finally, by substituting and to the inverse formulae, we reconstructed the interaction matrices . For the Monte Carlo simulation, we used the PyMC3 package .
Here, we used three types of graph structures, i.e., two-dimensional grid, three-dimensional grid, and fully-connected graphs. The numbers of random variables were set as ,
, and 16 for the two-dimensional grid, three-dimensional grid, and fully-connected graph, respectively. For each graph, we set two types of interaction matrices for attractive and mixed interactions. For both interactions, we used uniform distributionwith given interaction strength for all pairs as
The bias parameter was set by for both attractive and mixed settings.
For the inverse formulae, we compared the IP approximation (34), Bethe approximation (35), SM approximation (36), and TRW approximation (32). For TRW approximation, we set , which corresponds to assigning the probability to the edges uniformly (see the discussion in the conclusion). To measure the error in the reconstructed interaction, we compared it to the true interaction using the normalized distance:
The smaller becomes, the better the reconstruction is.
The reconstruction results for the two-dimensional grid, three-dimensional grid, and fully-connected graph are shown in Fig. 1. For all graph structures, the error measurement becomes large in small interaction strength regions irrespective of the inverse formulae because, in such small regions, the statistical uncertainty of the mean and covariance primarily dominates the errors .
When is apart from zero, the differences between the reconstruction errors become large. For all settings, the IP approximation typically gave the worst results. For the attractive interactions, the proposed formula based on TRW approximation gave the best accuracy. For the fully-connected graph, in which reconstruction is the most severe (note the scales in Fig. 1), the proposed formula demonstrated reasonable errors around .
However, for the mixed interactions, Bethe approximation gave the best results. Here, TRW approximation was comparable to the Bethe and SM approximations for grid graphs, and TRW approximation was better than SM approximation in the three-dimensional graph around . For the fully-connected model, TRW approximation gave a worse result, which is close to that of IP approximation.
Vi-B Real-world neural data
In the next experiment, we evaluated the inverse formulae with a task to infer latent effective networks of neurons , using real-world neural spike train data. Neural spike train data are composed by multidimensional series of spiking times. These data are converted to a set of spin states, to witch we attempted to fit the Ising model (2) by optimizing parameters and . The inferred interaction matrix represents effective synaptic connectivity between neurons and , i.e., if is positive, the connection between them is excitatory, and if is negative, the connection is inhibitory. The absolute value of represents strength of the connection.
In this task, we used neural spike train data , from CRCNS.org (ret-1) . The data were collected by multi-electrode array recordings of the retinal ganglion cells of mice in various conditions. From the whole data, we selected three recordings (in 20080516_R1.mat), because in these recordings, the number of recorded neurons, seven, was large enough not to give a trivial inverse problem, and not too large for applying the inverse formulae to obtain reasonable results (see the number of spins in the experiment in Sec. VI. A).
The method of this experiment was as follows : First, we converted the spike train data to a set of spin states. By giving bin size , we assigned a spin state to the -th neuron in the -th bin. Here, if there were one or more spikes, we set , otherwise . We used a bin size ms. Next, using the obtained spin states, we computed the statistics and . Finally, with these statistics, we inferred the interaction matrix of a fully-connected model using the inverse formulae.
For this experiment, we compared the IP approximation, Bethe approximation, SM approximation, and TRW approximation for the inverse formulae. For TRW approximation, we set .
To evaluate the estimates of the inverse formulae, we compared the inferred interaction matrices to what was obtained by the gradient ascent algorithm, because we did not know the exact interaction matrix, unlike the artificial models in the previous experiment. In each update of the gradient ascent algorithm, we computed the gradients and by running the Monte Carlo algorithm in one hundred steps using the PyMC3 package. We updated the parameters ten thousand times with the learning rate . It took about ninety minutes to compute one interaction matrix.
We measured the distances between the interaction matrices computed by the inverse formulae and by the gradient ascent algorithm measured by Eq.(38) with replaced by the gradient ascent result.
The distances between the estimates obtained by the inverse formulae and the gradient ascent algorithm measured by Eq. (38) are shown in Table I. As can be seen, for all three datasets, the developed TRW approximation gave the smallest distance from the gradient ascent results. Somewhat surprisingly, SM approximation always gives the worst result. The hierarchy of the accuracy of the approximations was consistent with the previous reconstruction experiments in the fully-connected graph with the attractive interaction ( Fig. 1 (e)). Let us remark that it took less than 10 ms to compute an interaction matrix using the inverse formulae.
Examples of the resulting interaction matrices of the IP, Bethe, and TRW approximations and of the gradient ascent algorithm are visualized in Fig. 2. Note that most elements of the matrix are positive in the gradient ascent result. Therefore, the problem can be considered similar to the attractive case in the fully-connected graph (Fig. 1(e)), as suggested by the hierarchy of ’s in Table I. Compared to the gradient ascent result, the IP and Bethe approximations tended to overestimate the interactions. Furthermore, the IP approximation could not reproduce the negative element in the interaction between the fourth and fifth neurons. While both Bethe and TRW approximations could reproduce the negative elements, TRW approximation gave a result that was the most similar to that of the gradient ascent as measured by Eq. (38).
Aiming at faster and more accurate learning, we have developed a new, iteration-free formula for the inverse Ising problem. Following a previous study , we combined the linear response relation and the pseudo-moment matching conditions to derive the inverse formula. A remarkable difference from that is that we used tree-reweighted free energy rather than Bethe free energy. An advantage of using tree-reweighted free energy is that we can optimize the rigorous lower bound of the exact objective function. We analytically obtained the resulting inverse formula (32), which gives the interaction matrix as the function of the edge appearance probability as well as the statistics of the input dataset and . Using this formula, we can compute the approximate interaction matrix in the same computational complexity as in Bethe approximation (35).
We compared the proposed inverse formula to other formulae in interaction reconstruction experiments with various graph structures and interaction matrices. We found that the proposed formula gave the best accuracy in models with attractive interaction matrices (i.e. the elements of the interaction matrices were positive). In particular, for fully-connected graphs, the proposed formula showed overwhelmingly good reconstructions compared to Bethe and Sessak-Monnason approximations. In contrast, for mixed interaction matrices(i.e. the elements were both positive and negative), the best approximation was obtained by Bethe approximation in most cases: however, the proposed formula gave comparable results in models with grid graphs.
We also applied our formula to real-world neural spike train data. In a task to infer latent effective networks, our formula gave a result most similar to that obtained by the gradient ascent algorithm in a Monte Carlo simulation. Note that the proposed inverse formula does not require iterative computations (except for matrix inversion), while the gradient ascent algorithm requires thousands of iterations.
Although we have demonstrated that the inverse formula we derived in tree-reweighted approximation is useful, some open questions should be solved for future improvement and practical applications. First, we have the free parameter which was fixed to uniform values in our experiments. There is no doubt that optimizing will improve the accuracy of the inverse formula. In fact, an optimal value that minimizes the upper bound has been discussed previously . However, it is difficult to obtain an optimal value by solving equations analytically, and we must perform iterative computations. Even though it is difficult to obtain the optimal value analytically, there may be an easy choice of that is superior to the uniform choice used in this study.
Another question is the extension of the inverse formula to models with hidden variables, e.g. restricted Boltzmann machines. The introduction of hidden variables makes the model drastically simpler and recognizable to a human being. We may extend the inverse formula directly to include hidden variables, or we may use the inverse formula in a step of expectation-maximization-like algorithms to reduce computational cost.
Finally, we are interested in applying tree-reweighted approximation and the proposed inverse formula to physics. Note that the partition function dominates the physical properties of a system, such as phase transitions and critical phenomena: thus, the tree-reweighted approximation, which can give the rigorous bound of the partition function, may play an important role in mathematical analysis of physical models. The proposed formulation, with which we analyzed the exact solution of the tree-reweighted free energy, may also be useful to give new insights into statistical and mathematical physics.
The author is grateful to Yuuji Ichisugi and anonymous reviewers for comments.
This paper is based on results obtained from a project commissioned by the New Energy and Industrial Technology Development Organization (NEDO).
-  D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. MIT press, 2009.
-  M. Mézard, G. Parisi, and M. Virasoro, Spin glass theory and beyond: An Introduction to the Replica Method and Its Applications. World Scientific Publishing, 1987.
-  E. Schneidman, M. J. Berry, R. Segev, and W. Bialek, “Weak pairwise correlations imply strongly correlated network states in a neural population,” Nature, vol. 440, no. 7087, pp. 1007–1012, 2006.
-  M. Weigt, R. A. White, H. Szurmant, J. A. Hoch, and T. Hwa, “Identification of direct residue contacts in protein-protein interaction by message passing,” Proceedings of the National Academy of Sciences, vol. 106, pp. 67–72, 2009.
-  M. Bailly-Bechet, A. Braunstein, A. Pagnani, M. Weigt, and R. Zecchina, “Inference of sparse combinatorial-control networks from gene-expression data: a message passing approach,” BMC Bioinformatics, vol. 11, p. 355, 2010.
H. C. Nguyen, R. Zecchina, and J. Berg, “Inverse statistical problems: from the inverse Ising problem to data science,”Advances in Physics, vol. 66, no. 3, pp. 197–261, 2017.
-  M. Welling and Y. Teh, “Approximate inference in Boltzmann machines,” Artificial Intelligence, vol. 143, pp. 19–50, 2003.
-  H. J. Kappen and F. B. Rodriguez, “Efficient learning in Boltzmann machines using linear response theory,” Neural Computation, vol. 10, no. 5, pp. 1137–1156, 1998.
-  F. Ricci-Tersenghi, “The Bethe approximation for solving the inverse Ising problem: a comparison with other inference methods,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2012, no. 8, 2012.
-  H. C. Nguyen and J. Berg, “Bethe-Peierls approximation and the inverse Ising problem,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2012, no. 03, 2012.
-  M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “A new class of upper bounds on the log partition function,” in Uncertainty in Artificial Intelligence, 2002.
-  M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky, “Tree-reweighted belief propagation algorithms and approximate ML estimation by pseudo-moment matching,” in Workshop on Artificial Intelligence and Statistics, 2003.
-  S.-I. Amari, “Differential Geometry of Curved Exponential Families – Curvatures and Information Loss,” The Annals of Statistics, vol. 10, no. 2, pp. 357–385, 1982.
-  M. Mézard and G. Parisi, “The Bethe lattice spin glass revisited,” The European Physical Journal B, vol. 20, p. 217, 2001.
-  M. Mézard and G. Parisi, “The Cavity Method at Zero Temperature,” Journal of Statistical Physics, vol. 111, no. 1, 2003.
-  J. Pearl, Probabilistic reasoning in intelligent systems: networks of plausible inference, 2nd ed. Morgan Kaufmann, 1988.
-  K. Murphy, Y. Weiss, and M. Jordan, “Loopy-belief Propagation for Approximate Inference: An Empirical Study,” in Uncertainty in Artificial Intelligence, 1999.
-  J. S. Yedidia, W. T. Freeman, and Y. Weiss, “Generalized Belief Propagation,” in Advances in neural information processing systems., 2000.
-  T. Heskes, “Stable fixed points of loopy belief propagation are minima of the Bethe free energy,” in Advances in neural information processing systems, 2003.
-  Y. Roudi, J. Tyrcha, and J. Hertz, “Ising model for neural data: Model quality and approximate methods for extracting functional connectivity,” Physical Review E, vol. 79, no. 051915, pp. 1–12, 2009.
T. Tanaka, “Mean-field theory of Boltzmann machine learning,”Physical Review E, vol. 58, no. 2, pp. 2302–2310, 1998.
-  V. Sessak and R. Monasson, “Small-correlation expansions for the inverse Ising problem,” Journal of Physics A: Mathematical and Theoretical, vol. 42, no. 5, p. 055001, 2009.
-  J. Salvatier, T. V. Wiecki, and C. Fonnesbeck, “Probabilistic programming in Python using PyMC3,” PeerJ Computer Science, vol. 2, p. e55, 2016.
-  E. Marinari and V. Van Kerrebroeck, “Intrinsic limitations of the susceptibility propagation inverse inference for the mean field Ising spin glass,” Journal of Statistical Mechanics: Theory and Experiment, vol. 2010, no. 2, pp. 1–13, 2010.
-  J. L. Lefebvre, Y. Zhang, M. Meister, X. Wang, and J. R. Sanes, “Gamma-Protocadherins regulate neuronal survival but are dispensable for circuit formation in retina,” Development, vol. 135, pp. 4141–4151, 2008.
-  Y.-F. Zhang, H. Asari, and M. Meister. (2014) Multi-electrode recordings from retinal ganglion cells. CRCNS.org. [Online]. Available: http://dx.doi.org/10.6080/K0RF5RZT