DeepAI
Log In Sign Up

Intervention in undirected Ising graphs and the partition function

Undirected graphical models have many applications in such areas as machine learning, image processing, and, recently, psychology. Psychopathology in particular has received a lot of attention, where symptoms of disorders are assumed to influence each other. One of the most relevant questions practically is on which symptom (node) to intervene to have the most impact. Interventions in undirected graphical models is equal to conditioning, and so we have available the machinery with the Ising model to determine the best strategy to intervene. In order to perform such calculations the partition function is required, which is computationally difficult. Here we use a Curie-Weiss approach to approximate the partition function in applications of interventions. We show that when the connection weights in the graph are equal within each clique then we obtain exactly the correct partition function. And if the weights vary according to a sub-Gaussian distribution, then the approximation is exponentially close to the correct one. We confirm these results with simulations.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

11/07/2017

Neural Variational Inference and Learning in Undirected Graphical Models

Many problems in machine learning are naturally expressed in the languag...
06/11/2020

Query Training: Learning and inference for directed and undirected graphical models

Probabilistic graphical models (PGMs) provide a compact representation o...
07/11/2012

Bayesian Learning in Undirected Graphical Models: Approximate MCMC algorithms

Bayesian learning in undirected graphical models|computing posterior dis...
07/04/2012

Piecewise Training for Undirected Models

For many large undirected models that arise in real-world applications, ...
05/24/2021

Partition Function Estimation: A Quantitative Study

Probabilistic graphical models have emerged as a powerful modeling tool ...
03/14/2018

Bucket Renormalization for Approximate Inference

Probabilistic graphical models are a key tool in machine learning applic...
10/19/2012

On Triangulating Dynamic Graphical Models

This paper introduces new methodology to triangulate dynamic Bayesian ne...

1 Introduction

Graphical models are popular in many applications such as machine learning, image processing, social science, and recently psychology. One of the earlier applications was in expert systems, where the objective was to determine the probability of a correct diagnosis given a specific configuration of symptoms and screenings

(Cowell et al., 1999). Such applications are also extremely relevant to psychology. Paramount to expert systems the effect of interventions (medication or therapy) is of interest. Lauritzen and Richardson (2002) showed that intervention by replacement (hard intervention) in undirected graphs is equivalent to conditioning, unlike intervention in directed acyclic graphs. As a consequence, no special treatment to determine (marginal) probabilities is required for interventions in undirected graphs.

Specifically, for the Ising model, where binary nodes are modeled by their values and their interactions with neighbouring (i.e., connected) nodes, determining the probability of an intervention means that we simply fix a variable to a specific value (0 or 1) and then determine the probabilities as we would in the conditional distribution. We do however require the partition function (normalising constant) which It boils down to marginalising with specific values for the conditioning variables. For the Ising model the marginal is computationally intensive, with elements to consider in the subset of nodes . Approximations to the partition function can be used to obtain approximate probabilities. One approach is to ignore the interactions altogether, which simplifies the partition function to a product of the partition function for each node separately. Another approach is to obtain upper and lower bounds of the partition function, like the Bethe lattice (Wainwright and Jordan, 2008) or the related version of locally tree-like graphs (Dembo et al., 2013). Here use a different approach and consider the fact that each clique is in itself a Curie-Weiss model (a fully connected graph), for which the partition function can be determined in linear time , where is the set of all cliques and

is the size of the largest clique. In a Curie-Weiss model the edge weights are considered equal, which is obviously inappropriate in many situations. We therefore determine that the error of approximation when the variation in edge weights is limited to sub-Gaussian random variables is

, with the size of the clique.

We first discuss undirected graphical models in Section 2. Next we discuss how interventions can be defined on undirected graphical models in Section 3 and how such conditioning is implemented in Ising models. Here the problem is the normalising constant (partition function) that prohibits calculation of probabilities. In Section 5 we discuss possible solutions and present our approach based on the Curie-Weiss model. In Section 6 we perform several simulations to illustrate the size of the errors in the normalising constant with the Curie-Weiss model. Proofs can be found in the Appendix.

2 Undirected graphical models

An undirected graphical model or Markov random field is a set of probability distributions representing the structure of some graph

. There are two equivalent ways of defining a Markov random field: (i) in terms of Markov properties and (ii) in terms of the factorization property.

Let be an undirected graph, where is the set of nodes and is the set of edges , with size . A subset of nodes is a cutset or separator set of the graph if removing results in two (or more) components. For instance, is a cutset if any path between any two nodes and must go through some . A clique is is a subset of nodes in such that all nodes in are connected, that is, for any it holds that . A maximal clique is a clique such that including any other node in will not be a clique.

For an undirected graph , we associate with each vertex a random variable . For any subset of nodes we define a configuration . A configuration for is . An edge set restricted to the edges among a subset is denoted by .

Two variables and are independent if , and we write this as . The variables and are conditionally independent on if . For subsets of nodes , , and , we denote by that is conditionally independent of given

. A random vector

is Markov compatible or Markov with respect to if whenever is a cutset that yields two disjoint subsets and . For strictly positive distributions the Hammersly-Clifford theorem says that the Markov property is equivalent to the factorisation property (Cowell et al., 1999; Lauritzen, 1996). The distribution of the random vector is said to factorise according to graph if it can be represented by a product of compatibility functions (not necessarily probabilities in general) of the cliques

(1)

where are compatibility functions for clique . This factorisation is convenient since it implies that the effects of conditioning can be evaluated for each clique separately.

One of the most well known binary undirected graphical models is the Ising model, known from statistical physics to model the magnetic field (see e.g., Kindermann et al., 1980; Cipra, 1987; Kolaczyk, 2009). The Ising model considers cliques of sizes one and two nodes only, so the interactions are at most pairwise (Wainwright and Jordan, 2008; Besag, 1974). Let be the parameter vector containing all parameters. The distribution of the Ising model can be written as

(2)

where

is the log normalization constant. It is immediate that the Ising model is exponential family with sufficient statistics . It is also minimal since the functions in are linearly independent, i.e. is not a constant a.e. for any nonzero .

3 Intervention and conditioning graph

The general idea of an intervention graph is the same as the causal directed graph (Lauritzen, 2001). An intervention is defined as a manipulation from outside the graph such that a variable (or set of variables) is fixed (clamped) to a particular value, where no other variable can affect this conditioning node (Spirtes, Meek and Richardson, 1996; Eberhardt and Scheines, 2007). This is equivalent to the do-operation (Pearl, 2000). No other nodes are affected directly by the intervention except those in the conditioning set. In an undirected graph the clique structure remains the same and the values are replaced by (intervention by replacement). We then want that the factorisation of the graph remains and the conditioning on the nodes in the cliques that intersect with the intervention nodes , does not disrupt the factorisiation (i.e., the graph remains Markov compatible). This leads to the following definition of intervening in undirected graphs (Lauritzen, 2001; Lauritzen and Richardson, 2002).

  • (Causal undirected graph) Let be a graph with Markov compatible distribution over the clique set in . Furthermore, let be the values of the nodes in the subset that replace the original values. Then we call it a causal undirected graph for if

    (3)

Note that when , then there is no intervention. We can equivalently write

(4)

where identifies sets with in . We see from this definition that we need only determine the intervention locally, with respect to the clique. Suppose that we intervene on node in clique with the value . Then we have from our definition that we only need the cliques such that to obtain and the rest of the terms in the factorisation remain as before.

The definition is still unclear on what for each clique factor means. Lauritzen and Richardson (2002) show that in undirected graphs with finite state space for each intervening by replacement (do-operation) is equivalent to conditioning, that is

(5)

The reason is that the structure of the undirected graph is not changed when intervening, at least not when using intervention by replacement. For directed (acyclic) graphs this is different because any incoming edges (arrows) on the intervention nodes will be deleted since the intervention completely controls them, and no other variables can affect the intervention nodes (Spirtes, Glymour and Sceines, 1993; Lauritzen, 2001). This changes the structure of the graph and therefore changes the distribution, and this difference between intervention and conditioning can be detected. In undirected graphs we cannot distinguish between having observed or intervened on values in . In an undirected graph nothing of the structure is changed and so there is no difference between intervention or conditioning to be detected in terms of conditional independencies.

Equivalently, we can think of an intervention on node as an additional node in the set directly connected to the intervention node in the intervention set (Spirtes, Glymour and Sceines, 1993; Eberhardt and Scheines, 2007). This node sets node to on or off (do-operation) and only node . If sets the node to off, then the observational distribution with respect to node obtains. If sets node to on, then the structure in remains unchanged, resulting in the same factorisation as without intervention but with the value of node set to 0 or 1. For each node in the intervention set there is a node that is connected only directly to node ; collectively such exogenous nodes are referred to as , where each node in is connected to a single node with .

Consider the graph in Figure 3(a) with five nodes and let be a binary vector. There are three cliques, , , and

. The joint distribution is

(6)

where we used the factorisation in (1). According to our definition of intervention, an intervention on node would result in the distribution

But by the fact that the intervention distribution equals the conditional distribution without , we obtain

where . And we observe that conditioning obtains the same distribution as intervening in undirected graphs.

Figure 1: Graph of 5 nodes with cliques , and represented in (a). In (b) equivalently intervening (conditioning) on node 2 in an undirected graph by an auxiliary variable that determines the output of node 2.

A representation of the equivalent version of intervening by an exogenous node is shown in Figure 3(b). It has the same setup as the previous example shown in Figure 3(a) but now the exogenous variable is added. It is clear that the same node cannot simultaneously intervene on another node as this would in general lead to possible spurious connections between the endogenous nodes in .

4 Conditioning in Ising models

In the Ising model any edge is represented by the product , and there are no higher order terms. Consider again Figure 3 with five nodes. If we assume that the external field is 0, then we only have the products of the cliques from the factorisation. So the joint distribution for the Ising model for Figure 3 can be written as

(7)

where is the normalising constant. And we immediately see we have the factorisation in (1). Then conditioning on node 2 having value requires we consider the normalising constant over the remaining variables . From the factorisation and equality of intervening and conditioning in (3), we have that we can consider each clique separately and then plug in the value for conditioning. So, for the clique we get

The normalising constant in the denominator is obtained by plugging in the possible values 0 and 1 respectively, obtaining ; we denote this normalising constant by . And for the clique we obtain

where the normalising constant of the denominator is determined by the values of with , , , and . For the clique we need not change anything, and so remains . We then obtain the conditional distribution from the factorisation theorem by making the product

(8)

where

Note that the complexity of the normalising constant depends on the size of the clique because the different cliques are conditionally independent; the larger the clique the more complex it will be to determine. In fact the complexity is with the number of nodes in the clique, and so is exponential in the number of nodes in a clique (Wainwright and Jordan, 2008). For small graphs the complexity is not prohibitive, since it can be computed directly. However, the complexity is problematic for large graphs where the clique size is large. In such cases we require a computationally efficient way to obtain the normalising constant of the clique, save the nodes that are conditioned on.

5 Determining the normalising constant

From the last example in the previous section it is clear that the normalising constant for the conditional distribution can be cumbersome but, fortunately, it can be factorised. The normalising constant is in general for the Ising model an NP-hard problem (see, e.g., Wainwright and Jordan, 2008). For instance, with 30 nodes we already require a sum over more than a billion terms. In optimisation algorithms the calculation of the normalising constant typically has to be calculated thousands of times, and so this makes such calculations infeasible.

A naive approach would be to simply ignore the interactions between the variables and imagine that we have an empty graph, such that all variables are independent. When assuming an empty graph, we obtain the well-known result that the product of the marginal probabilities is equal to the joint probability. In the graph without interactions, the empty graph, the joint probability for any subset because of independence is

(9)

where the normalising constant is . Hence, a quite simple approximation, called inner approximation, is obtained by

(10)

In the inner approximation the interaction parameters are completely ignored in the normalising constant. Hence, the estimate of the normalising constant

for subset will be lower than in reality, and is therefore called a lower bound (Wainwright and Jordan, 2008).

Another approach is based on the idea that the graph has no cycles and hence is a tree. This idea also underlies the so-called Bethe approximation (Wainwright and Jordan, 2008). A tree has no cycles and so there are only pairwise connections between the nodes, i.e., the maximal clique size is two. This implies that normalising constant can be written as the product of no more than two nodes each, since those are the cliques. And so, for the Ising model we obtain the normalising constant , where

It is clear that this is computationally much easier with , where is the number of edges of , than when cliques are involved. But in general, for graphs with cycles we obtain an approximation to the true normalising constant that depends on how close the true graph is to a tree (Wainwright and Jordan, 2008).

5.1 The Curie-Weiss graph

By the factorisation property in (1) we require normalising with at most the size of the clique variables. Because the factorisation property is defined over cliques, in which all nodes are connected to each other, we can invoke the normalising constant for each part of a complete graph, where all nodes are each others neighbour. When within a clique the threshold parameters are all equal and the interaction parameters are all equal, this subgraph can be modeled by a Curie-Weiss graph (Baxter, 2007), and the normalising constant of the Curie-Weiss graph is easier to determine (Marsman, 2018).

We simplify the model here by letting all interactions between nodes have the same parameter , and the threshold has parameter . Let be the average number of neighbours in . Then the effect of neighbours on any of the nodes is

In the mean field model we consider the effect on a node as if all nodes were connected to each other and we use the average effect of all other nodes on any one of them. So we obtain (Baxter, 2007)

where . Each sum can be obtained in ways for . This leads to the probability of a configuration with sum

with normalising constant

(11)

This version of a complete network as an approximation to one with on average neighbours is sometimes referred to as the Curie-Weiss model. The complexity of the normalising constant is , and so much smaller than for a graph with any edge distribution, which is in general. Using the Curie-Weiss version thus makes the problem of determining the normalising constant in the cliques linear and hence scalable.

We cannot blindly apply the Curie-Weiss computation to the conditional distribution in the Ising model because we have to deal with the values of the variables in the conditioning set.

We continue with the example of the Ising model corresponding to Figure 3(a) and determine the Curie-Weiss version of the normalising constant. There is no external field, so . For each clique where we need the normalising constant, we take the average of the parameters for interactions. For the clique we obtain the average parameter . We obtain two versions of the normalising constant, depending on the value of being 0 or 1. Considering the normalising constant for the part we see that the thresholds of and have changed to and for . In this example we had , and so a threshold appears by conditioning. Therefore, whenever is 0, then the thresholds remain as if was not there, and if is 1, then the thresholds change to and . In the Curie-Weiss version of the normalising constant we therefore get the average interaction parameter and the average threshold parameter

With these parameters we fill in equation (11) to obtain the normalising constant for the probability for the clique with and conditioned on .

From this example we can determine the general rule to obtain the normalising constant for any clique in the factorisation in the conditional distribution. It is clear that for any for some clique set from the Markov distribution and conditioning set , the interaction parameters for the that equals 1 will be added to the threshold parameter, otherwise the threshold parameters in the clique remain the same. Hence, for clique and conditioning set such that

(12)

and . This simple rule where we change the threshold parameters and leave the interaction parameters allows us tho apply the Curie-Weiss normalisation constant for each clique in the factorisation of the distribution. In the case that all threshold parameters are equal and all interaction parameters are equal within a clique, then this result is exact.

  • (Exact normalisation) Let be a graph induced by the Ising model with cliques in the set of all cliques , where for each clique the threshold parameters are equal and the interaction parameters are equal within the clique of the Ising model. Then for each clique the Curie-Weiss normalisation constant is identical to the exact normalisation constant, and hence, the normalising constant of graph , , is identical to the exact normalisation constant.

We consider the example from Figure 3(b), where we look at clique (see Figure 3(a)) and we use to condition on. If for the clique we take all parameters and, as before, , then, we obtain the equivalence according to Proposition 5.1

where is the Curie-Weiss normalisation constant obtained with (12) and (11). If , then we obtain , which equals

with each . If we change the edges to , , and , then , and . And when , we obtain , while the exact value is . We denote the approximation of the normalising constant for clique using the averages from (12) by .

We see from this small example that in general when the threshold and interaction parameters are different, then using the Curie-Weiss graph is an approximation. We should then ask under what circumstances is the error between the exact and approximate version bounded so that it may still be reasonable to use the approximation.

5.2 Bounding the error of

By assuming that the deviation of the parameters is small (concetration is high) we can guarantee that the error in the ratio of the exact and approximate normalisation constants (using the Curie-Weiss graph) is bounded. We will assume that the parameters are concentrated around the Curie-Weiss values in terms of sub-Gaussian variables with parameter , where

is the size of the clique. This is a strong assumption. For instance, when the distribution is normal the standard deviation of the distribution of the parameters around

and is divided by .

A sub-Gaussian random variable with mean is one for which exists such that for any . Taking the threshold and interaction parameters from a sub-Gaussian distribution with parameters and , respectively, together with the Hoeffding bound (see the Appendix and Boucheron, Lugosi and Massart (e.g., 2013) or Venkatesh (2013)) give the approximations

where means that there exists such that for any , for any . Plugging these approximations into the maximal term of the approximate version gives a bound on the ratio , which converges to 1.

  • (Error bound on ) Let be a graph associated with a set of random variables generated by the Ising probability (2). Furthermore, assume for each clique in that the threshold parameters with mean and interaction parameters with mean , are independent and sub-Gaussian variables with parameter and , respectively. Then for clique of size , with probability , the ratio is

    (13)

    as increases.

Equivalently, we could say that the difference between the normalising constants is

It follows from (13) that when the assumption of the sub-Gaussian parameter does not hold, then the error of the Curie-Weiss approximation when the interaction parameters are not equal, leads to undesirably large errors. And the constraint for the error of the Curie-Weiss approximation to disappear implies that the deviations of the parameters in the clique cannot be too far from their means. This is mostly problematic for large cliques, but for bounded cliques size, the bound in Proposition 5.2 indicates that differences will not be too severe.

6 Numerical illustration

To obtain a clear picture of realistic situations where we require the (conditional) clique normalising constant, we determine the error for different sized cliques and for different values of , the sub-Gaussian parameter. We vary the clique size from 10 to 100, where the error should be small at clique size 100. We vary the sub-Gaussian parameter from 1 to 10, where the error is highest at 10. The approximations are computed 100 times.

(a) (b)
Figure 2: Error of the normalisation constant compared to the exact value where all interaction parameters are equal. In (a) the error as a function of clique size and in (b) the error with a network of 50 nodes as a function of the sub-Gaussian parameter .

Figure 2(a) shows the error between the Curie-Weiss approximation and the exact value for different clique sizes. As expected, for small clique sizes the error is non-negligible and will have some effect on the probabilities. Note that the probability can either increase or decrease depending on over- or under-estimation of the normalising constant. A similar picture is obtained from Figure 2(b), where increasing the parameter causes larger errors of approximation for the Curie-Weiss normalising constant. Note that single computations of the Curie-Weiss normalising constant can be different for small clique sizes, but the average of several computations (here 100) appears quite accurate. Since computation of the Curie-Weiss normalising constant is fast for each clique, one might consider several estimates to obtain more accurate approximations.

7 Discussion

We considered the issue of intervening in Ising graphs with binary 0-1 nodes, where an intervention was defined by replacement (do-operator). This lead to the fact that interventions can be seen as conditioning on nodes in Ising graphs. To obtain the probabilities in intervention graphs was showed that for graphs with equal connectivities in each clique there is an exact solution to determining the normalisation constant by using the Curie-Weiss model. This simplifies computations considerably, going from to , where is the number of nodes in the graph, is the set of all cliques and is the largest clique size. We also showed that if the connectivities of the edges in the Ising graph are unequal, but the variation is sub-Gaussian, then the error is exponentially small. We confirmed these results with simulations. The simulations indicated that violating the the requirement of a sub-Gaussian variation in connectivities can be diminished if the computations are repeated several times.

Appendix

Proof.

of Proposition 5.1 By assumption, for each clique the threshold parameters are equal, and the interaction parameters are all equal. Hence, (12) gives exactly and . Since the clique is then equal to the Curie-Weiss graph, we obtain the exact normalising constant for the clique. By the factorisation (1) each of the terms are conditionally independent, and hence, the product of will equal . ∎

Proof.

of Proposition 5.2 We compare the normalising constant of the exact Curie-Weiss version where for all and for all , and the approximate version, where the and can all be different from the Curie-Weiss parameters. The exact Curie-Weiss version for clique of size we have the normalising constant

The approximate version is the same except that instead of and we use the averaged values and of clique defined in (12), and we denote this approximate normalising constant for clique by .

We consider the maximal value of the sum in the normalising constant in the exact case, that is,

Then we can use the Hoeffding bound to obtain an approximation to these Curie-Weiss parameters when the values for and are independently obtained from a sub-Gaussian distribution with means and , respectively. For independent sub-Gaussian random variables with mean and the same parameter , we obtain the Hoeffding bound for any

(14)

(see, e.g., Boucheron, Lugosi and Massart, 2013; Venkatesh, 2013). We use the Hoeffding bound and the assumption of sub-Gaussian variables with parameter for the threshold and for the interaction parameters. Let , which is the right hand side of the Hoeffding bound with parameter for the interaction parameters. Then we obtain . For and we have from the Hoeffding bound with probability

And so we obtain the approximations

Plugging these approximations into the maximal term of the approximate version gives

So, the maximal error that we incur for each term is , which converges to 1. Taking this common term out of the sum, shows that we obtain the result. ∎

References

  • Baxter (2007) [author] Baxter, Rodney JR. J. (2007). Exactly solved models in statistical mechanics. Courier Corporation.
  • Besag (1974) [author] Besag, JulianJ. (1974). Spatial Interaction and the Statistical Analysis of Lattice Systems. Journal of the Royal Statistical Society. Series B (Methodological) 36 192-236.
  • Boucheron, Lugosi and Massart (2013) [author] Boucheron, StéphaneS., Lugosi, GáborG. Massart, PascalP. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford university press.
  • Cipra (1987) [author] Cipra, B. A.B. A. (1987). An introduction to the Ising model. The American Mathematical Monthly 94 937-959.
  • Cowell et al. (1999) [author] Cowell, R. G.R. G., Dawid, A. P.A. P., Lauritzen, S. L.S. L. Spiegelhalter, D. J.D. J. (1999). Probabilistic networks and expert systems. Springer.
  • Dembo et al. (2013) [author] Dembo, AmirA., Montanari, AndreaA., Sun, NikeN. et al. (2013). Factor models on locally tree-like graphs. The Annals of Probability 41 4162–4213.
  • Eberhardt and Scheines (2007) [author] Eberhardt, FrederickF. Scheines, RichardR. (2007). Interventions and causal inference. Philosophy of Science 74 981–995.
  • Kindermann et al. (1980) [author] Kindermann, RossR., Snell, James LaurieJ. L. et al. (1980). Markov random fields and their applications 1. American Mathematical Society Providence, RI.
  • Kolaczyk (2009) [author] Kolaczyk, Eric D.E. D. (2009). Statistical analysis of network data: Methods and models. Springer, New York, NY.
  • Lauritzen (1996) [author] Lauritzen, S. L.S. L. (1996). Graphical Models. Oxford University Press.
  • Lauritzen (2001) [author] Lauritzen, S. L.S. L. (2001). Causal inference from graphical models. In Complex Stochastic Systems (O. E.O. E. Barndorff-Nielsen, D. R.D. R. Cox CC. Kluppelberg, eds.) 63–107. Chapman and Hall/CRC Press, London/Boca Raton.
  • Lauritzen and Richardson (2002) [author] Lauritzen, Steffen LS. L. Richardson, Thomas ST. S. (2002). Chain graph models and their causal interpretations. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 64 321–348.
  • Pearl (2000) [author] Pearl, J.J. (2000). Causality: Models and prediction. Cambridge University Press.
  • Spirtes, Glymour and Sceines (1993) [author] Spirtes, P.P., Glymour, C.C. Sceines, R.R. (1993). Causation, Prediction, and Search. Springer-Verlag.
  • Spirtes, Meek and Richardson (1996) [author] Spirtes, P.P., Meek, C.C. Richardson, T.T. (1996). Causal Inference in the Presence of Latent Variables and Selection Bias Technical Report No. CMU-77-Phil, Carnegie Mellon University.
  • Venkatesh (2013) [author] Venkatesh, S.S. (2013). The theory of probability. Cambridge University Press.
  • Wainwright and Jordan (2008) [author] Wainwright, Martin J.M. J. Jordan, Michael I.M. I. (2008). Graphical Models, Exponential Families, and Variational Inference. Foundations and Trends in Machine Learning 1 1-305.

4 Conditioning in Ising models

In the Ising model any edge is represented by the product , and there are no higher order terms. Consider again Figure 3 with five nodes. If we assume that the external field is 0, then we only have the products of the cliques from the factorisation. So the joint distribution for the Ising model for Figure 3 can be written as

(7)

where is the normalising constant. And we immediately see we have the factorisation in (1). Then conditioning on node 2 having value requires we consider the normalising constant over the remaining variables . From the factorisation and equality of intervening and conditioning in (3), we have that we can consider each clique separately and then plug in the value for conditioning. So, for the clique we get

The normalising constant in the denominator is obtained by plugging in the possible values 0 and 1 respectively, obtaining ; we denote this normalising constant by . And for the clique we obtain

where the normalising constant of the denominator is determined by the values of with , , , and . For the clique we need not change anything, and so remains . We then obtain the conditional distribution from the factorisation theorem by making the product

(8)

where

Note that the complexity of the normalising constant depends on the size of the clique because the different cliques are conditionally independent; the larger the clique the more complex it will be to determine. In fact the complexity is with the number of nodes in the clique, and so is exponential in the number of nodes in a clique (Wainwright and Jordan, 2008). For small graphs the complexity is not prohibitive, since it can be computed directly. However, the complexity is problematic for large graphs where the clique size is large. In such cases we require a computationally efficient way to obtain the normalising constant of the clique, save the nodes that are conditioned on.

5 Determining the normalising constant

From the last example in the previous section it is clear that the normalising constant for the conditional distribution can be cumbersome but, fortunately, it can be factorised. The normalising constant is in general for the Ising model an NP-hard problem (see, e.g., Wainwright and Jordan, 2008). For instance, with 30 nodes we already require a sum over more than a billion terms. In optimisation algorithms the calculation of the normalising constant typically has to be calculated thousands of times, and so this makes such calculations infeasible.

A naive approach would be to simply ignore the interactions between the variables and imagine that we have an empty graph, such that all variables are independent. When assuming an empty graph, we obtain the well-known result that the product of the marginal probabilities is equal to the joint probability. In the graph without interactions, the empty graph, the joint probability for any subset because of independence is

(9)

where the normalising constant is . Hence, a quite simple approximation, called inner approximation, is obtained by

(10)

In the inner approximation the interaction parameters are completely ignored in the normalising constant. Hence, the estimate of the normalising constant

for subset will be lower than in reality, and is therefore called a lower bound (Wainwright and Jordan, 2008).

Another approach is based on the idea that the graph has no cycles and hence is a tree. This idea also underlies the so-called Bethe approximation (Wainwright and Jordan, 2008). A tree has no cycles and so there are only pairwise connections between the nodes, i.e., the maximal clique size is two. This implies that normalising constant can be written as the product of no more than two nodes each, since those are the cliques. And so, for the Ising model we obtain the normalising constant , where

It is clear that this is computationally much easier with , where is the number of edges of , than when cliques are involved. But in general, for graphs with cycles we obtain an approximation to the true normalising constant that depends on how close the true graph is to a tree (Wainwright and Jordan, 2008).

5.1 The Curie-Weiss graph

By the factorisation property in (1) we require normalising with at most the size of the clique variables. Because the factorisation property is defined over cliques, in which all nodes are connected to each other, we can invoke the normalising constant for each part of a complete graph, where all nodes are each others neighbour. When within a clique the threshold parameters are all equal and the interaction parameters are all equal, this subgraph can be modeled by a Curie-Weiss graph (Baxter, 2007), and the normalising constant of the Curie-Weiss graph is easier to determine (Marsman, 2018).

We simplify the model here by letting all interactions between nodes have the same parameter , and the threshold has parameter . Let be the average number of neighbours in . Then the effect of neighbours on any of the nodes is

In the mean field model we consider the effect on a node as if all nodes were connected to each other and we use the average effect of all other nodes on any one of them. So we obtain (Baxter, 2007)

where . Each sum can be obtained in ways for . This leads to the probability of a configuration with sum

with normalising constant

(11)

This version of a complete network as an approximation to one with on average neighbours is sometimes referred to as the Curie-Weiss model. The complexity of the normalising constant is , and so much smaller than for a graph with any edge distribution, which is in general. Using the Curie-Weiss version thus makes the problem of determining the normalising constant in the cliques linear and hence scalable.

We cannot blindly apply the Curie-Weiss computation to the conditional distribution in the Ising model because we have to deal with the values of the variables in the conditioning set.

We continue with the example of the Ising model corresponding to Figure 3(a) and determine the Curie-Weiss version of the normalising constant. There is no external field, so . For each clique where we need the normalising constant, we take the average of the parameters for interactions. For the clique we obtain the average parameter . We obtain two versions of the normalising constant, depending on the value of being 0 or 1. Considering the normalising constant for the part we see that the thresholds of and have changed to and for . In this example we had , and so a threshold appears by conditioning. Therefore, whenever is 0, then the thresholds remain as if was not there, and if is 1, then the thresholds change to and . In the Curie-Weiss version of the normalising constant we therefore get the average interaction parameter and the average threshold parameter

With these parameters we fill in equation (11) to obtain the normalising constant for the probability for the clique with and conditioned on .

From this example we can determine the general rule to obtain the normalising constant for any clique in the factorisation in the conditional distribution. It is clear that for any for some clique set from the Markov distribution and conditioning set , the interaction parameters for the that equals 1 will be added to the threshold parameter, otherwise the threshold parameters in the clique remain the same. Hence, for clique and conditioning set such that

(12)

and . This simple rule where we change the threshold parameters and leave the interaction parameters allows us tho apply the Curie-Weiss normalisation constant for each clique in the factorisation of the distribution. In the case that all threshold parameters are equal and all interaction parameters are equal within a clique, then this result is exact.

  • (Exact normalisation) Let be a graph induced by the Ising model with cliques in the set of all cliques , where for each clique the threshold parameters are equal and the interaction parameters are equal within the clique of the Ising model. Then for each clique the Curie-Weiss normalisation constant is identical to the exact normalisation constant, and hence, the normalising constant of graph , , is identical to the exact normalisation constant.

We consider the example from Figure 3(b), where we look at clique (see Figure 3(a)) and we use to condition on. If for the clique we take all parameters and, as before, , then, we obtain the equivalence according to Proposition 5.1

where is the Curie-Weiss normalisation constant obtained with (12) and (11). If , then we obtain , which equals

with each . If we change the edges to , , and , then , and . And when , we obtain , while the exact value is . We denote the approximation of the normalising constant for clique using the averages from (12) by .

We see from this small example that in general when the threshold and interaction parameters are different, then using the Curie-Weiss graph is an approximation. We should then ask under what circumstances is the error between the exact and approximate version bounded so that it may still be reasonable to use the approximation.

5.2 Bounding the error of

By assuming that the deviation of the parameters is small (concetration is high) we can guarantee that the error in the ratio of the exact and approximate normalisation constants (using the Curie-Weiss graph) is bounded. We will assume that the parameters are concentrated around the Curie-Weiss values in terms of sub-Gaussian variables with parameter , where

is the size of the clique. This is a strong assumption. For instance, when the distribution is normal the standard deviation of the distribution of the parameters around

and is divided by .

A sub-Gaussian random variable with mean is one for which exists such that for any . Taking the threshold and interaction parameters from a sub-Gaussian distribution with parameters and , respectively, together with the Hoeffding bound (see the Appendix and Boucheron, Lugosi and Massart (e.g., 2013) or Venkatesh (2013)) give the approximations