 # Dependence and Relevance: A probabilistic view

We examine three probabilistic concepts related to the sentence "two variables have no bearing on each other". We explore the relationships between these three concepts and establish their relevance to the process of constructing similarity networks---a tool for acquiring probabilistic knowledge from human experts. We also establish a precise relationship between connectedness in Bayesian networks and relevance in probability.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The notion of relevance between pieces of information plays a key role in the theory of Bayesian networks and in the way they are used for inference. The intuition that guides the construction of Bayesian networks draws from the analogy between “connectedness” in graphical representations and “relevance” in the domain represented, that is, two nodes connected along some path correspond to variables of mutual relevance.

We examine three formal concepts related to the sentence “variables and have no bearing on each other”. First, two variables and are said to be mutually irrelevant if they are conditionally independent given any value of any subset of the other variables in the domain. Second, two variables are said to be uncoupled if the set of variables representing the domain can be partitioned into two independent sets one containing and the other containing . Finally, two variables and are unrelated if the corresponding nodes are disconnected in every minimal Bayesian network representation (to be defined).

The three concepts, mutual-irrelevance, uncoupledness, and unrelatedness are not identical. We show that uncoupledness and unrelatedness are always equivalent but sometimes differ from the notion of mutual-irrelevance. We identify a class of models called transitive for which all three concepts are equivalent. Strictly positive binary distributions (defined below) are examples of transitive models. We also show that “disconnectedness” in graphical representations and mutual-irrelevance in the domain represented coincide for every transitive model and for none other.

These results have theoretical and practical ramifications. Our analysis uses a qualitative abstraction of conditional independence known as graphoids [Pearl and Paz, 1989], and demonstrates the need for this abstraction in manipulating conditional independence assumptions (which are an integral part of every probabilistic reasoning engine). Our results also simplify the process of acquiring probabilistic knowledge from domain experts via similarity networks.

This article is organized as follows: A short overview on graphoids and their Bayesian network representation is provided in Section 2. (For more details consult Chapter 3 in Pearl, 1988.) Section 3 and 4 investigate properties of mutual-irrelevance, uncoupledness, and unrelatedness and their relation to each other. Section 5 discusses two definitions of similarity networks. Section 6 shows that for a large class of probability distributions these definitions are equivalent.

## 2 Graphoids and Bayesian Networks

Since our definitions of mutual-irrelevance, uncoupledness, and unrelatedness all rely on the notion of conditional independence, it is useful to abstract probability distributions to reflect this fact. In particular, every probability distribution is viewed as a list of conditional independence statements with no reference to numerical parameters. This abstraction, called a graphoid, was proposed by Pearl and Paz  and further discussed by Pearl  and Geiger .

Throughout the discussion we consider a finite set of variables each of which is associated with a finite set of values and a probability distribution having the Cartesian product as its sample space.

Definition A probability distribution is defined over if its sample space is .

We use lowercase letters possibly subscripted (e.g., , , or ) to denote variables, and use uppercase letters (e.g., , , or ) to denote sets of variables. A bold lowercase or uppercase letter refers to a value (instance) of a variable or of a set of variables, respectively. A value of a set of variables is an element in the Cartesian product , where is the set of values of . The notation stands for , where and is a value of .

Definition The expression , where , , and are disjoint subsets of , is called an independence statement, or independency. Its negation is called a dependence statement, or dependency. An independence or dependence statement is defined over if it mentions only elements of .

Definition Let be a finite set of variables with and as above. An independence statement is said to hold for if for every value , , and of , , and , respectively

 P(X=X∣Y=Y,Z=Z)=P(X=X∣Z=Z)

or . Equivalently, is said to satisfy . Otherwise, is said to satisfy .

Definition When holds for , then and are conditionally independent relative to and if , then and are marginally independent relative to .

Every probability distribution defines a dependency model.

Definition [Pearl, 1988] A dependency model over a finite set of elements is a set of triplets , where , and are disjoint subsets of .

The definition of dependency models does not assume any structure on the elements of . Namely, an element of could be, for example, a node in some graph or a name of a variable. In particular, if each is associated with a finite set , then every probability distribution having the Cartesian product as it sample space defines a dependency model via the rule:

 (X,Y∣Z)∈M if and only if I(X,Y∣Z) holds in P, (1)

for every disjoint subsets and of . Dependency models constructed using Equation 1 have some interesting properties that are summarized in the definition below.

Definition A Graphoid over a finite set is any set of triplets , where , , and are disjoint subsets of such that the following axioms are satisfied:

Trivial Independence

 (X,∅∣Z)∈M (2)

Symmetry

 (X,Y∣Z)∈M ⇒ (Y,X∣Z)∈M (3)

Decomposition

 (X,Y∪W∣Z)∈M ⇒ (X,Y∣Z)∈M (4)

Weak union

 (X,Y∪W∣Z)∈M ⇒ (X,Y∣Z∪W)∈M (5)

Contraction

 (X,Y∣Z)∈M&(X,W∣Z∪Y)∈M⇒ (7) (X,Y∪W∣Z)∈M.

The above relations are called the graphoid axioms.111This definition differs slightly from that given in [Pearl and Paz, 1989], where axioms (3) through (7) define semi-graphoids. A variant of these axioms was first studied by David  and Spohn .

Using the definition of conditional independence, it is easy to show that each probability distribution defines a graphoid via Equation 1 [Pearl, 1988]. The graphoid axioms have an appealing interpretation. For example, the weak union axiom states: If and are conditionally independent of , given a knowledge base , then is conditionally independent of given is added to the known knowledge base . In other words, the fact that a piece of information , which is conditionally independent of , becomes known, does not change the status of ; remains conditionally independent of given the new irrelevant information [Pearl, 1988].

Graphoids are suited to represent the qualitative part of a task that requires a probabilistic analysis. For example, suppose an alarm system is installed in your house in order to detect burglaries; and suppose it can be activated by two separate sensors. Suppose also that, when the alarm sound is activated, there is a good chance that a police patrol will show up. We are interested in computing the probability of a burglary given a police car is near your house.

The dependencies in this story can be represented by a graphoid. We consider five binary variables,

burglary, sensorA, sensorB, alarm, and patrol, each having two values yes and no. We know that the outcome of the two sensors are conditionally independent given burglary, and that alarm is conditionally independent of burglary given the outcome of the sensors. We also know that patrol is conditionally independent of burglary given alarm. (Assuming that only the alarm prompts a police patrol.) This qualitative information implies that the following three triplets must be included in a dependency model that describes the above story: (sensorA, sensorB burglary), (alarm, burglary {sensorA, sensorB}) and (patrol, {burglary, sensorA, sensorB} alarm).

The explicit representation of all triplets of a dependency model is often impractical, because there are an exponential number of possible triplets. Consequently, an implicit representation is needed. We will next describe such a representation.

Definition [Pearl, 1988] Let be a graphoid over . A directed acyclic graph is a Bayesian network of if is constructed from by the following steps: assign a construction order to the elements in , and designate a node for each . For each in , identify a set such that

 ({ui},{u1,…,ui−1}∖π(ui)∣π(ui))∈M. (8)

Assign a link from every element in to . The resulting network is minimal if, for each , no proper subset of satisfies Equation (8).

A Bayesian network of the burglary story is shown in Figure 1. We shall see that aside of the triplets that were used to construct the network, the triplet (patrol, burglary {sensorA, sensorB}) follows from the topology of the network. In the remainder of this section, we describe a general methodology to determine which triplets of a graphoid are represented in a Bayesian network of . The criteria of d-separation, defined below, provides the answer. Some preliminary definitions are needed.

Definition The underlying graph of a Bayesian network is an undirected graph obtained from the Bayesian network by replacing every link with an undirected edge.

Definition A trail in a Bayesian network is a sequence of links that form a simple (cycle-free) path in the underlying graph. Two nodes are connected in a Bayesian network if there exists a trail connecting them. Otherwise they are disconnected. If is a link in a Bayesian network, then is a parent of and is a child of . If there is a directed path of length greater than zero from to , then is an ancestor of and is a descendant of .

Definition (Pearl, 1988) A node is called a head-to-head node wrt (with respect to) a trail if there are two consecutive links and on .

For example, is a trail in Figure 1 and is a head-to-head node with respect to this trail.

Definition (Pearl, 1988) A trail is active wrt a set of nodes if (1) every head-to-head node wrt either is in or has a descendant in and (2) every other node along is outside . Otherwise, the trail is said to be blocked (or d-separated) by .

In Figure 1, for example, both trails between and are d-separated by ; the trail is d-separated by because node , which is not a head-to-head node wrt this trail, is in . The trail is d-separated by , because node and its descendant are outside . In contrast, is not d-separated by because is in .

The theorem below is the major building block for most of the developments presented in this article.

###### Theorem 1

[Verma and Pearl, 1988] Let be a Bayesian network of a graphoid over ; and let , , and be three disjoint subsets of . If all trails between a node in and a node in are d-separated by , then .

For example, in the Bayesian network of Figure 1, all trails between and are d-separated by . Thus, Theorem 1 guarantees that , where is the graphoid from which this network was constructed. 222Within the expression of independence statements, we often write instead of . Furthermore, Geiger and Pearl  show that no other graphical criteria reveals more triplets of than does d-separation.

Geiger et al.  generalize Theorem 1 to networks that include deterministic nodes (i.e., nodes whose value is a function of their parents’ values). Shachter  obtains related results. Lauritzen et al.  establish another graphical criteria and show that it is equivalent to d-separation.

## 3 Three Notions of Relevance

We can now define mutual-irrelevance, uncoupledness, and unrelatedness, and study their properties.

Definition Let be a graphoid over , and let .

• and are uncoupled if there exist a partition of such that , , and . Otherwise, and are coupled, denoted coupled(x,y).

• and are unrelated if and are disconnected in every minimal Bayesian network of . Otherwise, and are related, denoted related(x,y).

• and are mutually irrelevant if for every . Otherwise, and are mutually-relevant, denoted relevant(x,y).

We could have defined these concepts using conditional independence. By choosing the graphoid framework, however, we gain in two aspects. First, we emphasize that all the properties that we discuss are proved using the graphoid axioms. We do not use any properties of probability theory that are not summarized in these axioms. Second, our results are more general in that they appeal to any graphoid, not necessarily a graphoid that is defined by conditional independence, or even a graphoid defined by a probability distribution. Examples of other types of graphoids are given in Pearl .

Later in this section, we show that if two nodes and are disconnected in one minimal Bayesian network of , then and are disconnected in every minimal network of . Thus, to check whether and are unrelated, it suffices to examine whether or not they are connected in one minimal network representation rather than examine all possible minimal networks. This observation, demonstrated by Theorem 6, offers a considerable reduction in complexity. Based on the development that leads to Theorem 6, we also prove that and are unrelated if and only if they are uncoupled.

Definition A connected component of a Bayesian network is a subgraph of in which every two nodes are connected (by a trail). A connected component is maximal if there exists no proper super-graph of that is a connected component of .

###### Lemma 2

Let be a Bayesian network of a graphoid over , and let and be subsets of . If all nodes in are disconnected from all nodes in , then .

Proof: There is no active trail between a node in and a node in . Thus, by Theorem 1, .

###### Lemma 3

Let be a Bayesian network of a graphoid over , be in , be the set of ’s parents, and be the set of all nodes that are not descendants of except ’s parents. Then, .

Proof: The set d-separates all trails between a node in and , because each such trail either passes through a parent of and therefore is blocked by , or each such trail must reach through one of ’s children and thus must have a head-to-head node , where neither nor its descendants are in . Thus, by Theorem 1, .

###### Lemma 4

Let be a minimal Bayesian network of a graphoid over , be in , and be ’s parents. Then , unless .

Proof: Since are the parents of , by Lemma 3, , where is the set of ’s non-descendants except its parents. Assume, by contradiction, that and . The two triplets imply by the contraction axiom that . Since , is a proper subset of , where are the parents of in . Hence is not minimal, because Equation 8 is satisfied by a proper subset of ’s parents—a contradiction.

Definition A set is a partition of iff , , , and .

###### Lemma 5

Let be a graphoid over , be a minimal Bayesian network of , and be a connected component of with a set of nodes . Then, there exists no partition of such that .

Proof: Suppose is a partition of such that . Since and are connected in , there must exist a link between a node in and a node in . Without loss of generality, assume it is directed from a node in to a node in .

Let , be the parents of in and , respectively. The triplet , which we assumed to be in , implies—using symmetry and decomposition—that is in . By symmetry and weak union, is in as well. Thus, by Lemma 4, the network is not minimal, unless . We have assumed, however, that has a parent in . Hence, . Therefore, is not minimal, contrary to our assumption.

###### Theorem 6

Let be a graphoid over . If two elements of are disconnected in some minimal Bayesian network of , then they are disconnected in every minimal Bayesian network of .

Proof: It suffices to show that any two minimal Bayesian networks of share the same maximal connected components. Let and be two minimal Bayesian networks of . Let and be maximal connected components of and , respectively. Let and be the nodes of and , respectively. We show that either or . This demonstration will complete the proof, because for an arbitrary maximal connected component in there must exist a maximal connected component in that shares at least one node with . Thus, by the above claim, it must have exactly the same nodes as . Therefore, each maximal connected component of shares the same nodes with exactly one maximal connected component of . Hence, and share the same maximal connected components.

Since is a minimal Bayesian network of and is a maximal connected component of , by Lemma 2, . Using symmetry and decomposition, . Thus, by Lemma 5, for to be a maximal connected component, either or must be empty, lest would not be minimal. Similarly, for to be a maximal connected component, or must be empty. Thus, either or .

###### Theorem 7

Two variables and of a graphoid over are unrelated iff they are uncoupled.

Proof: If and are unrelated, then let be the variables connected to in some minimal network of , and be the rest of the variables in . By Lemma 2, . Thus, and are uncoupled.

If and are uncoupled, then there exist a partition of such that , , and . We show that in every minimal Bayesian network of , nodes and do not reside in the same connected component. Thus, and are unrelated. Assume, to the contrary, that and reside in the same maximal connected component of a minimal Bayesian network of , and that are the nodes in that component. The statement follows from by the symmetry and decomposition axioms. Moreover, and are not empty, because they include and , respectively. Since and are disjoint, the two sets , partition . Therefore, by Lemma 5, cannot be minimal, contrary to our assumption.

Theorem 7 shows that and are related if and only if they are coupled.

## 4 Transitive Graphoids

In this section, we show that if and are coupled, then and are mutually-relevant. Then, we identify conditions under which the converse holds, and provide an example in which these conditions are not met.

###### Theorem 8

Let be a graphoid over . Then, for every ,

 relevant(x,y)⇒coupled(x,y).

Proof: Suppose and are not coupled. Let , be a partition of such that , and . We show that and must be mutually irrelevant. Let be an arbitrary subset of . Let and . The statement implies—by decomposition and symmetry axioms—that . By symmetry and weak union, . Thus, for every . Hence, and are mutually irrelevant.

The converse of Theorem 8 does not hold in general; if and are mutually irrelevant, it does not imply that and are uncoupled. For example, assume is a graphoid over that consists of and the statements implied from them by the graphoid axioms. Then, and are mutually irrelevant, yet and are coupled because and .

To see that there is a probability distribution that induces this graphoid, suppose and are the outcomes of two independent fair coins. In addition, suppose that is a variable whose domain is and whose value is if and only if the outcome of is and the outcome of is . Then and are mutually irrelevant, because and are marginally independent and independent given . Nevertheless, they are coupled, because neither nor hold for .

A necessary and sufficient condition for the converse of Theorem 8 to hold, as we shall see, is that the graphoid is transitive.

Definition A graphoid over is transitive if for every ,

 \emrelevant(x,y)&\emrelevant(y,z)⇒\emrelevant(x,z). (9)

First, we show that transitivity is necessary.

###### Theorem 9

Let be a graphoid over such that for every , coupled(x,y) implies relevant(x,y). Then is a transitive graphoid.

Proof: By Lemma 8, relevant(x,y) if and only if coupled(x, y). Also, by Theorem 7, coupled(x, y) if and only if related(x, y). Since related is a transitive relation, so is relevant. Thus, is transitive.

Some preliminaries are needed before we show that transitivity is a sufficient condition as well.

Definition Let be a graphoid over , and , be two disjoint subsets of . Then and are mutually irrelevant, if for every that is a subset of .

###### Lemma 10

Let be a graphoid over , and , , and be three disjoint subsets of . If and are mutually irrelevant, and and are mutually irrelevant, then and are mutually irrelevant as well.

Proof: Denote the sentence “ and are mutually irrelevant” with . By definition, implies and implies , where is an arbitrary subset of . Together, these statements imply by the contraction axiom that . Since is arbitrary, holds.

As is well known from probability theory, if and are independent, and and are independent, then, contrary to our intuition, is not necessarily independent of . Lemma 10, on the other hand, shows that if and are mutually irrelevant, and and are mutually irrelevant, then and must also be mutually irrelevant.

###### Theorem 11

If is a transitive graphoid over , then for every ,

 coupled(x,y)⇒relevant(x,y).

Proof: Let be a transitive graphoid over , and , be two arbitrary elements in such that and are mutually irrelevant. We will show by induction on that if relevant is transitive, then there exists a Bayesian network of where and are disconnected. Consequently, and are uncoupled (Theorem 7).

We construct in the ordering of . Assume . Variables and are mutually irrelevant. Thus, . Hence, and are not connected. Otherwise, .

Let be a dependency model over formed from by removing all triplets involving . The model is a graphoid, because whenever the left hand side of one of the graphoid axioms does not mention , then neither does the right hand side. Let be a minimal Bayesian network of formed from by the construction order . Let be the set of nodes connected to , let be the set of nodes connected to , and let be the rest of the nodes in . The Bayesian network of is formed from by adding the last node as a sink and letting its parents be a minimal set that makes independent of all the rest of the variables in (following the definition of minimal Bayesian networks).

Since and are mutually irrelevant in , it follows that they are also mutually irrelevant in . Thus, by the induction hypothesis, and are disconnected in . After node is added, a trail through might exists in that connects a node in and a node in . We will show that there is none; if the parent set of is minimal, then either has no parents in or it has no parents in , rendering and disconnected in .

Since and are mutually irrelevant, it follows that either and are mutually irrelevant or and are mutually irrelevant, lest would not be transitive. Without loss of generality, assume that and are mutually irrelevant. Let be an arbitrary node in . By transitivity it follows that either and are mutually irrelevant or and are mutually irrelevant, lest and would not be mutually irrelevant, contrary to our selection of . If and are mutually irrelevant, then by the induction hypothesis, can be partitioned into two marginally independent subsets. Thus, by Lemma 5, would not be connected in the Bayesian network , contradicting our selection of . Thus, every element and are mutually irrelevant. It follows that the entire set and are mutually irrelevant (Lemma 10). Thus, in particular, , where are the parents of in , and are the parents of in . Assume is the set of parents of in . By decomposition, implies . By Theorem 4, is not minimal, unless is empty.

Theorems 8, 9 and 11 show that the relations coupled and relevant are identical for every transitive graphoid and for none other. We emphasize that these results apply also to every probability distribution that defines a transitive graphoid. In section 6, we show that many probability distributions indeed define transitive graphoids. First, however, we pause to demonstrate the relationship of these results to knowledge acquisition and knowledge representation.

## 5 Similarity Networks

Similarity networks were invented by Heckerman  as a tool for constructing large Bayesian networks from domain experts judgements. Heckerman used them to construct a large diagnosis system for lymph-node pathology. The main advantage of similarity networks is their ability to utilize statements of conditional independence that are not represented in a Bayesian network, in order to reduce more drastically the number of parameters a domain expert needs to specify. Furthermore, the construction of a large Bayesian network is divided into several stages each of which involves the construction of a small local Bayesian network. This divide and conquer approach helps to elicit reliable expert judgements. At the diagnosis stage, the local networks are combined into one global Bayesian network that represents the entire domain.

In [Geiger and Heckerman, 1993], we show how to use the local networks directly for inference without converting them to a global Bayesian network, and remove several technical restrictions imposed by the original development. Also, we develop two simple definitions of similarity networks which we present here informally. In the next section, we show that although the two definitions are conceptually distinct they often coincide.

A Bayesian network of a probability distribution

is constructed as defined in Section 2 with an important addition. After the topology of the network is set, we also associate with each node a conditional probability distribution:

. By the chaining rule it follows that

 P(u1,…,un)=∏P(ui∣u1,…,ui−1)

and by the definition of we further obtain

 P(u1,…,un)=∏P(ui∣π(ui))

Thus, the joint distribution is represented by the network and can be used for computing the posterior probability of every variable, given a specific value for some other variables. For example, for the network of the burglary story (Figure

1), we need to specify the following conditional distributions: , , , , and . From these numbers, we can now compute any probability involving these variables.

A similarity network is a set of Bayesian networks, called the local networks, each constructed under a different set of hypotheses . In each local network , only those variables that “help to distinguish” between the hypotheses in are depicted. The success of this model stems from the fact that only a small portion of variables helps to distinguish between the carefully chosen set of hypotheses ,. Thus, the model usually includes several small networks instead of one large Bayesian network.

For example, Figure 2 is an example of a similarity network representation of where is a distinguished variable that represents five hypotheses . In this similarity network, variable is the only one that helps to discriminate between and , and variable is the only variable that does not help to discriminate among .

At the heart of the definition of similarity networks lies the notion of discrimination. The study of the relations coupled, related and relevant presented in the previous sections, enables us to formulate this notion in two ways, yielding two types of similarity networks.

Definition [Geiger and Heckerman, 1993] A similarity network constructed by including in each local network every variable , such that and are related given that draws its values from , is of type 1. A similarity network constructed by including in each local network every variable , such that and are relevant given that draws its values from , is of type 2.

In [Geiger and Heckerman, 1993], we show that type 1 similarity networks are diagnostically complete. That is, although some variables are removed from each local network, the posterior probability of every hypothesis, given any value combination for the variables in

, can still be computed. This result is reassuring because it guarantees that the computation we strive to achieve—namely, the computation of the posterior probability of the hypothesis— can be performed. The caveat of this result is that a knowledge engineer uses a type 1 similarity network to determine whether a variable “helps to discriminate” the values in

, by asking a domain expert whether the node corresponding to this variable is connected to in the local Bayesian network associated with . This query might be too hard for a domain expert to answer, because a domain expert does not necessarily understand the properties of Bayesian networks.

On the other hand, a knowledge engineer uses a type 2 similarity networks to determine whether a node “does not help to discriminate” the values in , by asking an expert whether this variable can ever help to distinguish the values of , given that draws its values from . This query concerns the subject matter of the domain; and therefore a domain expert can more reliably answer the query. In fact, this is the actual query Heckerman used in constructing his lymph-node pathology diagnosis system.

Next, we show that these two definitions coincide for large families of probability distributions.

## 6 Transitive Distributions

We show that the relation relevant

is transitive whenever it is defined by a probability distribution that belongs to one of the following two families: strictly positive binary distributions and regular Gaussian distributions. Hence, for these two classes of distributions, type 1 and type 2 similarity networks are identical. Currently, we are working to show that transitivity holds for other families.

Definition A strictly positive binary distribution is a probability distribution where every variable has a domain of two values—say, and —and every combination of the variables’ values has a probability greater than zero. A regular Gaussian

distribution is a multivariate normal distribution with finite nonzero variances and with finite means.

###### Theorem 12

Let be a strictly positive binary distribution or a regular Gaussian distribution. Let , and be three partitions of . Let be , and be . Then,

 I(X1,X2∣∅)&I(Y1,Y2∣e=e′)&I(Z1,Z2∣e=e′′)⇒ (11) I(R1,{e}∪U∖R1∣∅)∨I(R2,{e}∪U∖R2∣∅)

where and are two distinct values of .

When all three partitions are identical, the above theorem can be phrased as follows. If two sets of variables and are marginally independent, and if holds as well, then either is marginally independent of or is marginally independent of . This special case has been stated in the literature [Dawid, 1979, Pearl, 1988].

The proof of Theorem 12 is given in Appendices A and B. Theorem 12 and the theorem below state together that strictly positive binary distributions and regular Gaussian distributions are transitive. Our assumptions of strict positiveness and regularity were added to obtain a simpler proof. We conjecture that both theorems still hold when these restrictions are omitted.

###### Theorem 13

Every probability distribution that satisfies Equation 11 is transitive.

Proof: Let be a probability distribution, let ; and let , be two arbitrary variables in such that and are mutually irrelevant. We will show by induction on that if satisfies Equation 11, then and are uncoupled. Thus, according to Theorem 9, is transitive.

If , then the variables and are mutually irrelevant. Thus, holds for . Consequently, and are uncoupled. Otherwise, assume without loss of generality that is and is , and denote by . Since and are mutually irrelevant with respect to , and are also mutually irrelevant with respect to , , and , where and are two distinct values of . Thus, by applying the the induction hypothesis three times, we conclude that there are three partitions , , and of such that is in , , and , and is in , , and . Hence, the antecedents of Equation 11 are satisfied. Consequently, can be partitioned into two marginally independent sets: either and , or and , where is and is . Because, in both cases, one set contains and the other contains , it follows that and are uncoupled.

The practical ramification of this theorem is that our concern of how to define discrimination via the relation related or via relevant is not critical. In many situations the two concepts coincide.

From a mathematical point of view, our proof demonstrates that using an abstraction of conditional independence—namely, the trinary relation combined with a set of axioms—we are able to prove properties of very distinct classes of distributions: strictly positive binary distributions and regular Gaussian distributions.

## 7 Summary

We have examined the notion of unrelatedness of variables in a probabilistic framework, introduced three formulations of this notion, and explored their interrelationships. From a practical point of view, these results legitimize prevailing decomposition techniques of knowledge acquisition. These results permit an expert to decompose the construction of a complex Bayesian network into a set of Bayesian networks of manageable size.

Our proofs use the qualitative notion of independence as captured by the axioms of graphoids. These proofs would have been harder to obtain had we used the usual definitions of conditional independence. This axiomatic approach enables us to identify a common property—Equation 11—shared by two distinct classes of probability distributions (regular Gaussian and strictly positive binary), and to use this property without attending to the detailed characteristics of these classes.

## Acknowledgments

An earlier version of this article has been presented in the sixth conference on uncertainty in artificial intelligence

[Geiger and Heckerman, 1990].

## References

• [Dawid, 1979] Dawid, A.P. (1979). Conditional independence in statistical theory. Journal of the Royal Statistical Society B 41, 1-31.
• [Geiger, 1990] Geiger D. 1990. Graphoids: A Qualitative Framework for Probabilistic Inference. PhD. Diss., Computer Science Dept., UCLA.
• [Geiger and Heckerman, 1993] Knowledge representation and inference in similarity networks and Bayesian multinets. Submitted.
• [Geiger and Heckerman, 1990] Geiger D., and Heckeman, D. (1990). Separable and transitive graphoids. Proc. Sixth Conference on Uncertainty in Artificial Intelligence.
• [Geiger et al., 1990] Geiger, D.; Verma, T.S.; and Pearl, J. 1990. Identifying Independence in Bayesian Networks. Networks, 20:507-534.
• [Geiger and Pearl, 1990] Geiger D.; and Pearl J. 1990. On the logic of Causal Models. In Uncertainty in Artificial Intelligence 4, eds. Shachter R. D.; Levitt T.S.; Kanal L.N.; and Lemmer J.F., 3–12. Elsevier Science Publishers B.V. (North-Holland).
• [Heckerman, 1990] Heckerman, D. 1990. Probabilistic Similarity Networks. MIT press.
• [Lauritzen et al., 1992] Lauritzen, S.L.; Dawid, A. P.; Larsen B. N. and Leimer H. G. Independence properties of directed Markov fields. Networks, 20: 491-506.
• [Pearl, 1988] Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. San Mateo, CA: Morgan Kaufmann.
• [Pearl and Paz, 1989] Pearl, J. and Paz, A. 1989. Graphoids: A Graph-Based Logic for Reasoning about Relevance Relations. In Advances in Artificial Intelligence-II, eds. Du Boulay, B. et al., 357–363. Amsterdam: North Holland.
• [Shachter, 1992] An ordered examination of influence diagrams. Networks, 20:535-564.
• [Spohn, 1980] Spohn, W. (1980). Stochastic independence, causal independence, and shieldability. J. of Philosophical logic 9 73-99.
• [Verma and Pearl, 1988] Verma, T. and Pearl, J. (1988). Causal networks: Semantics and expressiveness. In Proceedings of Fourth Workshop on Uncertainty in Artificial Intelligence, Minneapolis, MN, pages 352–359. Association for Uncertainty in Artificial Intelligence, Mountain View, CA. Also, Technical Report R-65 by Verma, Cognitive Systems Laboratory, University of California, Los Angeles, 1986.

## Appendix A: Strictly Positive Binary Distributions

Below, we prove Theorem 12 for strictly positive binary distributions. First, we phrase the theorem differently.

###### Theorem 14

Strictly positive binary distributions satisfy the following axiom:333In complicated expressions, is used as a shorthand notation for and denotes .

 I(A1,eA2A3A4B1B2B3B4∣∅)∨I(B1,eA1A2A3A4B2B3B4∣∅) (12)

where all sets mentioned are pairwise disjoint and do not contain , and and are distinct values of .

To obtain the original theorem, we set , , , , , and to be equal to of the original theorem, respectively.

Denote the three antecedents of Equation 12 by , , and . We need the following two Lemmas.

###### Lemma 15

Let and be two disjoint sets of variables, and let be an instance of a single binary variable not in . Let be a probability distribution over the variables . If holds for , then for every pair of instances of and of , the following equation must hold:

 P(e|X′Y′)P(X′Y′)P(e|X′′Y′)P(X′′Y′)=P(e|X′Y′′)P(X′Y′′)P(e|X′′Y′′)P(X′′Y′′)

Proof:Bayes’ theorem states that

 P(X′|eY′)=P(e|X′Y′)P(X′Y′)P(eY′)

Thus,

 P(e|X′Y′)P(X′Y′)P(e|X′′Y′)P(X′′Y′)=P(X′|e,Y′)P(X′′|e,Y′)=P(X′|e,Y′′)P(X′′|e,Y′′)=P(e|X′Y′′)P(X′Y′′)P(e|X′′Y′′)P(X′′Y′′)

The middle equality follows from the fact that holds for .

###### Lemma 16

Let , , , , , , , and be disjoint sets of variables, and be a single binary variable not contained in any of these sets. Let be a probability distribution over the union of these variables. If the antecedents , , and of Equation 12 hold for , then the following conditions must also hold:

 I(A1,e∣A′2A′3A′4B′1B′2B′3B′4) ⇒ I(A1,e∣A′2A′3A′4B1B2B′3B′4) (13) I(B1,e∣A′1A′2A′3A′4B′2B′3B′4) ⇒ I(B1,e∣A1A2A′3A′4B′2B′3B′4) (14) I(A1,e∣A′2A′3A′4B′1B′2B′3B′4) ⇒ I(A1,e∣A′2A′3A′4B1B′2B3B′4) (15) I(B1,e∣A′1A′2A′3A′4B′2B′3B′4) ⇒ I(B1,e∣A1A′2A3A′4B′2B′3B′4) (16) I(A1,e∣A′2A′3A′4B′1B′2B′3B′4) ⇒ I(A1,e∣A′2A′3A4B1B′2B′3B′4) (17) I(B1,e∣A′1A′2A′3A′4B′2B′3B′4) ⇒ I(B1,e∣A1A′2A′3A′4B