DeepAI

# Intervention in undirected Ising graphs and the partition function

Undirected graphical models have many applications in such areas as machine learning, image processing, and, recently, psychology. Psychopathology in particular has received a lot of attention, where symptoms of disorders are assumed to influence each other. One of the most relevant questions practically is on which symptom (node) to intervene to have the most impact. Interventions in undirected graphical models is equal to conditioning, and so we have available the machinery with the Ising model to determine the best strategy to intervene. In order to perform such calculations the partition function is required, which is computationally difficult. Here we use a Curie-Weiss approach to approximate the partition function in applications of interventions. We show that when the connection weights in the graph are equal within each clique then we obtain exactly the correct partition function. And if the weights vary according to a sub-Gaussian distribution, then the approximation is exponentially close to the correct one. We confirm these results with simulations.

• 8 publications
• 8 publications
11/07/2017

### Neural Variational Inference and Learning in Undirected Graphical Models

Many problems in machine learning are naturally expressed in the languag...
06/11/2020

### Query Training: Learning and inference for directed and undirected graphical models

Probabilistic graphical models (PGMs) provide a compact representation o...
07/11/2012

### Bayesian Learning in Undirected Graphical Models: Approximate MCMC algorithms

Bayesian learning in undirected graphical models|computing posterior dis...
07/04/2012

### Piecewise Training for Undirected Models

For many large undirected models that arise in real-world applications, ...
05/24/2021

### Partition Function Estimation: A Quantitative Study

Probabilistic graphical models have emerged as a powerful modeling tool ...
03/14/2018

### Bucket Renormalization for Approximate Inference

Probabilistic graphical models are a key tool in machine learning applic...
10/19/2012

### On Triangulating Dynamic Graphical Models

This paper introduces new methodology to triangulate dynamic Bayesian ne...

## 1 Introduction

Graphical models are popular in many applications such as machine learning, image processing, social science, and recently psychology. One of the earlier applications was in expert systems, where the objective was to determine the probability of a correct diagnosis given a specific configuration of symptoms and screenings

(Cowell et al., 1999). Such applications are also extremely relevant to psychology. Paramount to expert systems the effect of interventions (medication or therapy) is of interest. Lauritzen and Richardson (2002) showed that intervention by replacement (hard intervention) in undirected graphs is equivalent to conditioning, unlike intervention in directed acyclic graphs. As a consequence, no special treatment to determine (marginal) probabilities is required for interventions in undirected graphs.

Specifically, for the Ising model, where binary nodes are modeled by their values and their interactions with neighbouring (i.e., connected) nodes, determining the probability of an intervention means that we simply fix a variable to a specific value (0 or 1) and then determine the probabilities as we would in the conditional distribution. We do however require the partition function (normalising constant) which It boils down to marginalising with specific values for the conditioning variables. For the Ising model the marginal is computationally intensive, with elements to consider in the subset of nodes . Approximations to the partition function can be used to obtain approximate probabilities. One approach is to ignore the interactions altogether, which simplifies the partition function to a product of the partition function for each node separately. Another approach is to obtain upper and lower bounds of the partition function, like the Bethe lattice (Wainwright and Jordan, 2008) or the related version of locally tree-like graphs (Dembo et al., 2013). Here use a different approach and consider the fact that each clique is in itself a Curie-Weiss model (a fully connected graph), for which the partition function can be determined in linear time , where is the set of all cliques and

is the size of the largest clique. In a Curie-Weiss model the edge weights are considered equal, which is obviously inappropriate in many situations. We therefore determine that the error of approximation when the variation in edge weights is limited to sub-Gaussian random variables is

, with the size of the clique.

We first discuss undirected graphical models in Section 2. Next we discuss how interventions can be defined on undirected graphical models in Section 3 and how such conditioning is implemented in Ising models. Here the problem is the normalising constant (partition function) that prohibits calculation of probabilities. In Section 5 we discuss possible solutions and present our approach based on the Curie-Weiss model. In Section 6 we perform several simulations to illustrate the size of the errors in the normalising constant with the Curie-Weiss model. Proofs can be found in the Appendix.

## 2 Undirected graphical models

An undirected graphical model or Markov random field is a set of probability distributions representing the structure of some graph

. There are two equivalent ways of defining a Markov random field: (i) in terms of Markov properties and (ii) in terms of the factorization property.

Let be an undirected graph, where is the set of nodes and is the set of edges , with size . A subset of nodes is a cutset or separator set of the graph if removing results in two (or more) components. For instance, is a cutset if any path between any two nodes and must go through some . A clique is is a subset of nodes in such that all nodes in are connected, that is, for any it holds that . A maximal clique is a clique such that including any other node in will not be a clique.

For an undirected graph , we associate with each vertex a random variable . For any subset of nodes we define a configuration . A configuration for is . An edge set restricted to the edges among a subset is denoted by .

Two variables and are independent if , and we write this as . The variables and are conditionally independent on if . For subsets of nodes , , and , we denote by that is conditionally independent of given

. A random vector

is Markov compatible or Markov with respect to if whenever is a cutset that yields two disjoint subsets and . For strictly positive distributions the Hammersly-Clifford theorem says that the Markov property is equivalent to the factorisation property (Cowell et al., 1999; Lauritzen, 1996). The distribution of the random vector is said to factorise according to graph if it can be represented by a product of compatibility functions (not necessarily probabilities in general) of the cliques

 p(x)=1ZV∏C∈CψC(xC) (1)

where are compatibility functions for clique . This factorisation is convenient since it implies that the effects of conditioning can be evaluated for each clique separately.

One of the most well known binary undirected graphical models is the Ising model, known from statistical physics to model the magnetic field (see e.g., Kindermann et al., 1980; Cipra, 1987; Kolaczyk, 2009). The Ising model considers cliques of sizes one and two nodes only, so the interactions are at most pairwise (Wainwright and Jordan, 2008; Besag, 1974). Let be the parameter vector containing all parameters. The distribution of the Ising model can be written as

 pθ(x)=exp⎛⎝∑s∈Vθsxs+∑(s,t)∈Eθstxsxt−A(θ)⎞⎠ (2)

where

 A(θ):=log∑x∈{0,1}pexp⎛⎝∑s∈Vθsxs+∑(s,t)∈Eθstxsxt⎞⎠

is the log normalization constant. It is immediate that the Ising model is exponential family with sufficient statistics . It is also minimal since the functions in are linearly independent, i.e. is not a constant a.e. for any nonzero .

## 3 Intervention and conditioning graph

The general idea of an intervention graph is the same as the causal directed graph (Lauritzen, 2001). An intervention is defined as a manipulation from outside the graph such that a variable (or set of variables) is fixed (clamped) to a particular value, where no other variable can affect this conditioning node (Spirtes, Meek and Richardson, 1996; Eberhardt and Scheines, 2007). This is equivalent to the do-operation (Pearl, 2000). No other nodes are affected directly by the intervention except those in the conditioning set. In an undirected graph the clique structure remains the same and the values are replaced by (intervention by replacement). We then want that the factorisation of the graph remains and the conditioning on the nodes in the cliques that intersect with the intervention nodes , does not disrupt the factorisiation (i.e., the graph remains Markov compatible). This leads to the following definition of intervening in undirected graphs (Lauritzen, 2001; Lauritzen and Richardson, 2002).

• (Causal undirected graph) Let be a graph with Markov compatible distribution over the clique set in . Furthermore, let be the values of the nodes in the subset that replace the original values. Then we call it a causal undirected graph for if

 p(x∣∣x⋆A)=∏C∈CpC(xC∖A∣∣x⋆C∩A) (3)

Note that when , then there is no intervention. We can equivalently write

 p(x∣∣x⋆A)=∏C∋C∩A≠∅pC(xC∖A∣∣x⋆C∩A)∏C∋C∩A=∅pC(xC) (4)

where identifies sets with in . We see from this definition that we need only determine the intervention locally, with respect to the clique. Suppose that we intervene on node in clique with the value . Then we have from our definition that we only need the cliques such that to obtain and the rest of the terms in the factorisation remain as before.

The definition is still unclear on what for each clique factor means. Lauritzen and Richardson (2002) show that in undirected graphs with finite state space for each intervening by replacement (do-operation) is equivalent to conditioning, that is

 pC(xC∣∣x⋆C∩A)=pC(xC∖A∣x⋆C∩A) (5)

The reason is that the structure of the undirected graph is not changed when intervening, at least not when using intervention by replacement. For directed (acyclic) graphs this is different because any incoming edges (arrows) on the intervention nodes will be deleted since the intervention completely controls them, and no other variables can affect the intervention nodes (Spirtes, Glymour and Sceines, 1993; Lauritzen, 2001). This changes the structure of the graph and therefore changes the distribution, and this difference between intervention and conditioning can be detected. In undirected graphs we cannot distinguish between having observed or intervened on values in . In an undirected graph nothing of the structure is changed and so there is no difference between intervention or conditioning to be detected in terms of conditional independencies.

Equivalently, we can think of an intervention on node as an additional node in the set directly connected to the intervention node in the intervention set (Spirtes, Glymour and Sceines, 1993; Eberhardt and Scheines, 2007). This node sets node to on or off (do-operation) and only node . If sets the node to off, then the observational distribution with respect to node obtains. If sets node to on, then the structure in remains unchanged, resulting in the same factorisation as without intervention but with the value of node set to 0 or 1. For each node in the intervention set there is a node that is connected only directly to node ; collectively such exogenous nodes are referred to as , where each node in is connected to a single node with .

Consider the graph in Figure 3(a) with five nodes and let be a binary vector. There are three cliques, , , and

. The joint distribution is

 p(x)=p(x1,x5)p(x1,x2)p(x2,x3,x4) (6)

where we used the factorisation in (1). According to our definition of intervention, an intervention on node would result in the distribution

 p(x∣∣x⋆2)=p(x1,x5)p(x1∣∣x⋆2)p(x3,x4∣∣x⋆2)

But by the fact that the intervention distribution equals the conditional distribution without , we obtain

 p(x∣∣x⋆2)=p(x1,x5)p(x1∣x⋆2)p(x3,x4∣x⋆2)=p(x∖2∣x⋆2)

where . And we observe that conditioning obtains the same distribution as intervening in undirected graphs.

## 4 Conditioning in Ising models

In the Ising model any edge is represented by the product , and there are no higher order terms. Consider again Figure 3 with five nodes. If we assume that the external field is 0, then we only have the products of the cliques from the factorisation. So the joint distribution for the Ising model for Figure 3 can be written as

 pθ(x) =1Z(θ)exp(θ15x1x5)exp(θ12x1x2)exp(θ23x2x3+θ24x2x4+θ34x3x4) (7)

where is the normalising constant. And we immediately see we have the factorisation in (1). Then conditioning on node 2 having value requires we consider the normalising constant over the remaining variables . From the factorisation and equality of intervening and conditioning in (3), we have that we can consider each clique separately and then plug in the value for conditioning. So, for the clique we get

 p(x1∣x⋆2)=exp(θ12x1x⋆2)1+exp(θ12x⋆2)

The normalising constant in the denominator is obtained by plugging in the possible values 0 and 1 respectively, obtaining ; we denote this normalising constant by . And for the clique we obtain

 p(x3,x4∣x⋆2)=exp(θ23x⋆2x3+θ24x⋆2x4+θ34x3x4)1+exp(θ23x⋆2)+exp(θ24x⋆2)+exp(θ34)

where the normalising constant of the denominator is determined by the values of with , , , and . For the clique we need not change anything, and so remains . We then obtain the conditional distribution from the factorisation theorem by making the product

 pθ(x∖2∣x⋆2) =1Z∖2exp(θ15x1x5)exp(θ12x1x⋆2)exp(θ23x⋆2x3+θ24x⋆2x4+θ34x3x4) (8)

where

 Z∖2=Z15,∖2Z1,∖2Z34,∖2

Note that the complexity of the normalising constant depends on the size of the clique because the different cliques are conditionally independent; the larger the clique the more complex it will be to determine. In fact the complexity is with the number of nodes in the clique, and so is exponential in the number of nodes in a clique (Wainwright and Jordan, 2008). For small graphs the complexity is not prohibitive, since it can be computed directly. However, the complexity is problematic for large graphs where the clique size is large. In such cases we require a computationally efficient way to obtain the normalising constant of the clique, save the nodes that are conditioned on.

## 5 Determining the normalising constant

From the last example in the previous section it is clear that the normalising constant for the conditional distribution can be cumbersome but, fortunately, it can be factorised. The normalising constant is in general for the Ising model an NP-hard problem (see, e.g., Wainwright and Jordan, 2008). For instance, with 30 nodes we already require a sum over more than a billion terms. In optimisation algorithms the calculation of the normalising constant typically has to be calculated thousands of times, and so this makes such calculations infeasible.

A naive approach would be to simply ignore the interactions between the variables and imagine that we have an empty graph, such that all variables are independent. When assuming an empty graph, we obtain the well-known result that the product of the marginal probabilities is equal to the joint probability. In the graph without interactions, the empty graph, the joint probability for any subset because of independence is

 pC(xC)=1ZCexp(∑i∈Cθixi)=∏i∈C1Ziexp(θixi)=∏i∈Cpi(xi) (9)

where the normalising constant is . Hence, a quite simple approximation, called inner approximation, is obtained by

 ZC=∏i∈CZi=∏i∈C(1+exp(θi)) (10)

In the inner approximation the interaction parameters are completely ignored in the normalising constant. Hence, the estimate of the normalising constant

for subset will be lower than in reality, and is therefore called a lower bound (Wainwright and Jordan, 2008).

Another approach is based on the idea that the graph has no cycles and hence is a tree. This idea also underlies the so-called Bethe approximation (Wainwright and Jordan, 2008). A tree has no cycles and so there are only pairwise connections between the nodes, i.e., the maximal clique size is two. This implies that normalising constant can be written as the product of no more than two nodes each, since those are the cliques. And so, for the Ising model we obtain the normalising constant , where

 Zij=1+exp(θi)+exp(θj)+exp(θi+θj+θij)

It is clear that this is computationally much easier with , where is the number of edges of , than when cliques are involved. But in general, for graphs with cycles we obtain an approximation to the true normalising constant that depends on how close the true graph is to a tree (Wainwright and Jordan, 2008).

### 5.1 The Curie-Weiss graph

By the factorisation property in (1) we require normalising with at most the size of the clique variables. Because the factorisation property is defined over cliques, in which all nodes are connected to each other, we can invoke the normalising constant for each part of a complete graph, where all nodes are each others neighbour. When within a clique the threshold parameters are all equal and the interaction parameters are all equal, this subgraph can be modeled by a Curie-Weiss graph (Baxter, 2007), and the normalising constant of the Curie-Weiss graph is easier to determine (Marsman, 2018).

We simplify the model here by letting all interactions between nodes have the same parameter , and the threshold has parameter . Let be the average number of neighbours in . Then the effect of neighbours on any of the nodes is

 θ0+θ1ν∑j=1xj

In the mean field model we consider the effect on a node as if all nodes were connected to each other and we use the average effect of all other nodes on any one of them. So we obtain (Baxter, 2007)

 θ0+θ1νn−1sn

where . Each sum can be obtained in ways for . This leads to the probability of a configuration with sum

 p(x;r)=1ZCW(nr)exp(θ0r+ν2(n−1)θ1r(r−1))

with normalising constant

 ZCW=n∑r=0(nr)exp(θ0r+ν2(n−1)θ1r(r−1)) (11)

This version of a complete network as an approximation to one with on average neighbours is sometimes referred to as the Curie-Weiss model. The complexity of the normalising constant is , and so much smaller than for a graph with any edge distribution, which is in general. Using the Curie-Weiss version thus makes the problem of determining the normalising constant in the cliques linear and hence scalable.

We cannot blindly apply the Curie-Weiss computation to the conditional distribution in the Ising model because we have to deal with the values of the variables in the conditioning set.

We continue with the example of the Ising model corresponding to Figure 3(a) and determine the Curie-Weiss version of the normalising constant. There is no external field, so . For each clique where we need the normalising constant, we take the average of the parameters for interactions. For the clique we obtain the average parameter . We obtain two versions of the normalising constant, depending on the value of being 0 or 1. Considering the normalising constant for the part we see that the thresholds of and have changed to and for . In this example we had , and so a threshold appears by conditioning. Therefore, whenever is 0, then the thresholds remain as if was not there, and if is 1, then the thresholds change to and . In the Curie-Weiss version of the normalising constant we therefore get the average interaction parameter and the average threshold parameter

 θ⋆0={(θ23+θ24)/2%ifx⋆2=10 otherwise

With these parameters we fill in equation (11) to obtain the normalising constant for the probability for the clique with and conditioned on .

From this example we can determine the general rule to obtain the normalising constant for any clique in the factorisation in the conditional distribution. It is clear that for any for some clique set from the Markov distribution and conditioning set , the interaction parameters for the that equals 1 will be added to the threshold parameter, otherwise the threshold parameters in the clique remain the same. Hence, for clique and conditioning set such that

 θ⋆0=⎧⎨⎩ave(θi+∑j∈A∩Cx⋆j=1θij,i∈C∩Ac) if there is% j∈A∩C s.t. x⋆j=1ave(θi,i∈C∩Ac) otherwise (12)

and . This simple rule where we change the threshold parameters and leave the interaction parameters allows us tho apply the Curie-Weiss normalisation constant for each clique in the factorisation of the distribution. In the case that all threshold parameters are equal and all interaction parameters are equal within a clique, then this result is exact.

• (Exact normalisation) Let be a graph induced by the Ising model with cliques in the set of all cliques , where for each clique the threshold parameters are equal and the interaction parameters are equal within the clique of the Ising model. Then for each clique the Curie-Weiss normalisation constant is identical to the exact normalisation constant, and hence, the normalising constant of graph , , is identical to the exact normalisation constant.

We consider the example from Figure 3(b), where we look at clique (see Figure 3(a)) and we use to condition on. If for the clique we take all parameters and, as before, , then, we obtain the equivalence according to Proposition 5.1

 pC3(xC3∣∣x⋆2)=p{3,4}(x{3,4}∣x⋆2)=1Z{3,4}∣2⋆exp(θ23x⋆2x3+θ24x⋆2x4+θ34x3x4)

where is the Curie-Weiss normalisation constant obtained with (12) and (11). If , then we obtain , which equals

 1+exp(θ24)+exp(θ23)+exp(θ23+θ24+θ34)

with each . If we change the edges to , , and , then , and . And when , we obtain , while the exact value is . We denote the approximation of the normalising constant for clique using the averages from (12) by .

We see from this small example that in general when the threshold and interaction parameters are different, then using the Curie-Weiss graph is an approximation. We should then ask under what circumstances is the error between the exact and approximate version bounded so that it may still be reasonable to use the approximation.

### 5.2 Bounding the error of Zcw

By assuming that the deviation of the parameters is small (concetration is high) we can guarantee that the error in the ratio of the exact and approximate normalisation constants (using the Curie-Weiss graph) is bounded. We will assume that the parameters are concentrated around the Curie-Weiss values in terms of sub-Gaussian variables with parameter , where

is the size of the clique. This is a strong assumption. For instance, when the distribution is normal the standard deviation of the distribution of the parameters around

and is divided by .

A sub-Gaussian random variable with mean is one for which exists such that for any . Taking the threshold and interaction parameters from a sub-Gaussian distribution with parameters and , respectively, together with the Hoeffding bound (see the Appendix and Boucheron, Lugosi and Massart (e.g., 2013) or Venkatesh (2013)) give the approximations

 ¯θ0=θ0+Op(k−3/2