# Approximate Implication with d-Separation

The graphical structure of Probabilistic Graphical Models (PGMs) encodes the conditional independence (CI) relations that hold in the modeled distribution. Graph algorithms, such as d-separation, use this structure to infer additional conditional independencies, and to query whether a specific CI holds in the distribution. The premise of all current systems-of-inference for deriving CIs in PGMs, is that the set of CIs used for the construction of the PGM hold exactly. In practice, algorithms for extracting the structure of PGMs from data, discover approximate CIs that do not hold exactly in the distribution. In this paper, we ask how the error in this set propagates to the inferred CIs read off the graphical structure. More precisely, what guarantee can we provide on the inferred CI when the set of CIs that entailed it hold only approximately? It has recently been shown that in the general case, no such guarantee can be provided. We prove that such a guarantee exists for the set of CIs inferred in directed graphical models, making the d-separation algorithm a sound and complete system for inferring approximate CIs. We also prove an approximation guarantee for independence relations derived from marginal CIs.

## Authors

• 6 publications
06/15/2012

### Identifying Independence in Relational Models

The rules of d-separation provide a framework for deriving conditional i...
06/22/2019

### Local Exchangeability

Exchangeability---in which the distribution of an infinite sequence is i...
09/03/2019

### On perfectness in Gaussian graphical models

Knowing when a graphical model is perfect to a distribution is essential...
04/01/2021

### Hereditary rigidity, separation and density In memory of Professor I.G. Rosenberg

We continue the investigation of systems of hereditarily rigid relations...
01/07/2021

### An Algorithm for the Discovery of Independence from Data

For years, independence has been considered as an important concept in m...
10/19/2017

### The Geometry of Gaussoids

A gaussoid is a combinatorial structure that encodes independence in pro...
10/19/2012

### Using the structure of d-connecting paths as a qualitative measure of the strength of dependence

Pearls concept OF a d - connecting path IS one OF the foundations OF the...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Conditional independencies (CI) are assertions of the form

, stating that the random variables (RVs)

and are independent when conditioned on . The concept of conditional independence is at the core of Probabilistic graphical Models (PGMs) that include Bayesian and Markov networks. The CI relations between the random variables enable the modular and low-dimensional representations of high-dimensional, multivariate distributions, and tame the complexity of inference and learning, which would otherwise be very inefficient [DBLP:books/daglib/0023091, DBLP:books/daglib/0066829].

The implication problem is the task of determining whether a set of CIs termed antecedents logically entail another CI, called the consequent, and it has received considerable attention from both the AI and Database communities [DBLP:conf/ecai/PearlP86, DBLP:conf/uai/GeigerVP89, DBLP:journals/iandc/GeigerPP91, SAYRAFI2008221, DBLP:conf/icdt/KenigS20, DBLP:conf/sigmod/KenigMPSS20]

. Known algorithms for deriving CIs from the topological structure of the graphical model are, in fact, an instance of implication. Notably, the DAG structure of Bayesian Networks is generated based on a set of CIs termed the

recursive basis [DBLP:journals/networks/GeigerVP90], and the -separation algorithm is used to derive additional CIs, implied by this set. The

-separation algorithm is a sound and complete method for deriving CIs in probability distributions represented by DAGs

[DBLP:conf/uai/GeigerVP89, DBLP:journals/networks/GeigerVP90], and hence completely characterizes the CIs that hold in the distribution. The foundation of deriving CIs in both directed and undirected models is the semigraphoid axioms [Dawid1979, GEIGER1991128, GeigerPearl1993].

Current systems for inferring CIs, and the semigraphoid axioms in particular, assume that both antecedents and consequent hold exactly, hence we refer to these as an exact implication (EI). However, almost all known approaches for learning the structure of a PGM rely on CIs extracted from data, which hold to a large degree, but cannot be expected to hold exactly. Of these, structure-learning approaches based on information theory have been shown to be particularly successful, and thus widely used to infer networks in many fields [CHENG200243, JMLR:v7:decampos06a, Chen2008TKDE, Zhao5130, DBLP:conf/sigmod/KenigMPSS20].

In this paper, we drop the assumption that the CIs hold exactly, and consider the relaxation problem: if an exact implication holds, does an approximate implication hold too? That is, if the antecedents approximately hold in the distribution, does the consequent approximately hold as well ? What guarantees can we give for the approximation? In other words, the relaxation problem asks whether we can convert an exact implication to an approximate one. When relaxation holds, then any system-of-inference for deriving exact implications, (e.g. the semigraphoid axioms, -separation), can be used to infer approximate implications as well.

To study the relaxation problem we need to measure the degree of satisfaction of a CI. In line with previous work, we use Information Theory. This is the natural semantics for modeling CIs because if and only if , where is the conditional mutual information. Hence, an exact implication (EI) is an assertion of the form , where are triples , and is the conditional mutual information measure . An approximate implication (AI) is a linear inequality , where , and is the approximation factor. We say that a class of CIs -relaxes if every exact implication (EI) from the class can be transformed to an approximate implication (AI) with an approximation factor . We observe that an approximate implication always implies an exact implication because the mutual information is a nonnegative measure. Therefore, if for some , then .

Results. A conditional independence assertion is called saturated if it mentions all of the random variables in the distribution, and it is called marginal if .

We show that every conditional independence relation read off a DAG by the d-separation algorithm [DBLP:conf/uai/GeigerVP89], admits a -approximation. In other words, if is the recursive basis of CIs used to build the Bayesian network [DBLP:conf/uai/GeigerVP89], then it is guaranteed that . Furthermore, we present a family of implications for which our -approximation is tight (i.e., ). We also prove that every CI implied by a set of marginal CIs admits an -approximation (i.e., where denotes the number of RVs in the set ). The exact variant of implication from these classes of CIs were extensively studied  [DBLP:conf/uai/GeigerVP89, GeigerPearl1993, DBLP:conf/uai/GeigerP88, DBLP:journals/iandc/GeigerPP91, DBLP:journals/networks/GeigerVP90] (see below the related work). Here, we study their approximation.

Of independent interest is the technique used for proving the approximation guarantees. The I-measure [DBLP:journals/tit/Yeung91] is a theory which establishes a one-to-one correspondence between information theoretic measures such as entropy and mutual information (defined in Section 2) and set theory. Ours is the first to apply this technique to the study of CI implication.

Related Work. The AI community has extensively studied the exact implication problem for Conditional Independencies (CI). In a series of papers, Geiger et al. showed that the semigraphoid axioms [DBLP:conf/ecai/PearlP86] are sound and complete for deriving CI statements that are implied by saturated CIs [GeigerPearl1993], marginal CIs [GeigerPearl1993], and recursive CIs that are used in Bayesian networks [DBLP:journals/networks/GeigerVP90, DBLP:conf/uai/GeigerP88]. The completeness of -separation follows from the fact that the set of CIs derived by -separation is precisely the closure of the recursive basis under the semgraphoid axioms [VERMA199069]. Studený proved that in the general case, when no assumptions are made on the antecendents, no finite axiomatization exists [StudenyCINoCharacterization1990]. That is, there does not exist a finite set of axioms (deductive rules) from which all general conditional independence implications can be deduced.

The database community has also studied the EI problem for integrity constraints [DBLP:journals/tods/ArmstrongD80, DBLP:conf/sigmod/BeeriFH77, 10.1007/978-3-642-39992-3_17, Maier:1983:TRD:1097039], and showed that the implication problem is decidable and axiomatizable when the antecedents are Functional Dependencies or Multivalued Dependencies (which correspond to saturated CIs, see [DBLP:journals/tse/Lee87, DBLP:conf/icdt/KenigS20]), and undecidable for Embedded Multivalued Dependencies [10.1006/inco.1995.1148].

The relaxation problem was first studied by Kenig and Suciu in the context of database dependencies [DBLP:conf/icdt/KenigS20], where they showed that CIs derived from a set of saturated antecedents, admit an approximate implication. Importantly, they also showed that not all exact implications relax, and presented a family of 4-variable distributions along with an exact implication that does not admit an approximation (see Theorem 16 in [DBLP:conf/icdt/KenigS20]). Consequently, it is not straightforward that exact implication necessarily imply its approximation counterpart, and arriving at meaningful approximation guarantees requires making certain assumptions on the antecedents, consequent, or both.

Organization. We start in Section 2 with preliminaries. We formally define the relaxation problem in Section 3, and formally state our results in Section 4. In Section 5 we establish, through a series of lemmas, properties of exact implication that will be used for proving our results. In Section 6 we prove that every implication from a set of recursive CIs admits a 1-relaxation, and in Section 7 we prove that every implication from a set of marginal CIs admits an -relaxation. We conclude in Section 8.

## 2 Preliminaries

We denote by . If denotes a set of variables and , then we abbreviate the union with .

### 2.1 Conditional Independence

Recall that two discrete random variables

are called independent if for all outcomes . Fix , a set of jointly distributed discrete random variables with finite domains , respectively; let

be the probability mass. For

, denote by the joint random variable with domain . We write when are conditionally independent given ; in the special case that functionally determines , we write .

An assertion is called a Conditional Independence statement, or a CI; this includes as a special case. When we call it saturated, and when we call it marginal. A set of CIs implies a CI , in notation , if every probability distribution that satisfies also satisfies .

### 2.2 Background on Information Theory

We adopt required notation from the literature on information theory [Yeung:2008:ITN:1457455]. For , we identify the functions

with the vectors in

.

Polymatroids. A function is called a polymatroid if and satisfies the following inequalities, called Shannon inequalities:

1. Monotonicity: for

2. Submodularity: for all

The set of polymatroids is denoted . For any polymatroid and subsets , we define111Recall that denotes .

 h(B|A)\tiny def= h(AB)−h(A) (1) Ih(B;C|A)\tiny def= h(AB)+h(AC)−h(ABC)−h(A) (2)

Then, , by submodularity, and by monotonicity. We say that functionally determines , in notation if . The chain rule is the identity:

 Ih(B;CD|A)=Ih(B;C|A)+Ih(B;D|AC) (3)

We call the triple elemental if ; is a special case of , because . By the chain rule, it follows that every CI can be written as a sum of at most elemental CIs.

Entropy. If is a random variable with a finite domain and probability mass , then denotes its entropy

 H(X)\tiny def=∑x∈Dp(x)log1p(x) (4)

For a set of jointly distributed random variables we define the function as ; is called an entropic function, or, with some abuse, an entropy. It is easily verified that the entropy satisfies the Shannon inequalities, and is thus a polymatroid. The quantities and are called the conditional entropy and conditional mutual information respectively. The conditional independence holds iff , and similarly iff , thus, entropy provides us with an alternative characterization of CIs.

#### 2.2.1 The I-measure

The I-measure [DBLP:journals/tit/Yeung91, Yeung:2008:ITN:1457455] is a theory which establishes a one-to-one correspondence between Shannon’s information measures and set theory. Let denote a polymatroid defined over the variables . Every variable is associated with a set , and it’s complement . The universal set is . Let . We denote by , and .

###### Definition 2.1.

([DBLP:journals/tit/Yeung91, Yeung:2008:ITN:1457455])   The field generated by sets is the collection of sets which can be obtained by any sequence of usual set operations (union, intersection, complement, and difference) on .

The atoms of are sets of the form , where is either or . We denote by the atoms of . We consider only atoms in which at least one set appears in positive form (i.e., the atom is empty). There are non-empty atoms and sets in expressed as the union of its atoms. A function is set additive if for every pair of disjoint sets and it holds that . A real function defined on is called a signed measure if it is set additive, and .

The -measure on is defined by for all nonempty subsets , where is the entropy (4). Table 1 summarizes the extension of this definition to the rest of the Shannon measures.

Yeung’s I-measure Theorem establishes the one-to-one correspondence between Shannon’s information measures and .

###### Theorem 2.2.

([DBLP:journals/tit/Yeung91, Yeung:2008:ITN:1457455])   [I-Measure Theorem] is the unique signed measure on which is consistent with all Shannon’s information measures (i.e., entropies, conditional entropies, and mutual information).

Let . We denote by the set associated with (see Table 1). For a set of triples , we define:

 m(Σ)\tiny def=⋃σ∈Σm(σ) (5)
###### Example 2.3.

Let , , and be three disjoint sets of RVs defined as follows: , and . Then, by Theorem 2.2: , , and . By Table 1: .

We denote by the set of signed measures that assign non-negative values to the atoms . We call these positive I-measures.

###### Theorem 2.4.

([Yeung:2008:ITN:1457455])   If there is no constraint on , then can take any set of nonnegative values on the nonempty atoms of .

Theorem 2.4 implies that every positive I-measure corresponds to a function that is consistent with the Shannon inequalities, and is thus a polymatroid. Hence, is the set of polymatroids with a positive I-measure that we call positive polymatroids.

### 2.3 Bayesian Networks

A Bayesian network encodes the CIs of a probability distribution using a Directed Acyclic Graph (DAG). Each node in a Bayesian network corresponds to the variable , a set of nodes correspond to the set of variables , and is a value from the domain of . Each node in the network represents the distribution where is a set of variables that correspond to the parent nodes of . The distribution represented by a Bayesian network is

 p(x1,…,xn)=n∏i=1p(xi|xπ(i)) (6)

(when has no parents then ).

Equation 6 implicitly encodes a set of conditional independence statements, called the recursive basis for the network:

 Σ\tiny def={(Xi;X1…Xi−1∖π(Xi)∣π(Xi)):i∈[n]} (7)

The implication problem associated with Bayesian Networks is to determine whether for a CI . Geiger and Pearl have shown that iff can be derived from using the semigraphoid axioms [DBLP:journals/networks/GeigerVP90]. Their result establishes that the semigraphoid axioms are sound and complete for inferring CI statements from the recursive basis.

## 3 The Relaxation Problem

We now formally define the relaxation problem. We fix a set of variables , and consider triples of the form , where , which we call a conditional independence, CI. An implication is a formula , where is a set of CIs called antecedents and is a CI called consequent. For a CI , we define , for a set of CIs , we define . Fix a set s.t. .

###### Definition 3.1.

The exact implication (EI) holds in , denoted if, forall , implies . The -approximate implication (-AI) holds in , in notation , if , . The approximate implication holds, in notation , if there exist a finite such that the -AI holds.

Notice that both exact (EI) and approximate (AI) implications are preserved under subsets of : if and , then , for .

Approximate implication always implies its exact counterpart. Indeed, if and , then , which further implies that , because for every triple , and every polymatroid . In this paper we study the reverse.

###### Definition 3.2.

Let be a syntactically-defined class of implication statements , and let . We say that admits a -relaxation in , if every exact implication statement in has a -approximation:

 K⊨EIΣ⇒τ iff K⊨AIλ⋅h(Σ)≥h(τ).

In this paper, we focus on -relaxation in the set of polymatroids, and two syntactically-defined classes: 1) Where is the recursive basis of a Bayesian network (see (7)), and 2) Where is a set of marginal CIs.

###### Example 3.3.

Let , and . Since , and since by the chain rule (3), then the exact implication admits an AI with (i.e., a -).

## 4 Formal STATEMENT OF RESULTS

We generalize the results of Geiger et al. [DBLP:conf/uai/GeigerVP89, GEIGER1991128], by proving that implicates of the recursive set [DBLP:conf/uai/GeigerVP89], and of marginal CIs [GEIGER1991128], admit a , and -approximation respectively, and thus continue to hold also approximately.

### 4.1 Implication From Recursive CIs

Geiger et al. [DBLP:conf/uai/GeigerVP89] prove that the semigraphoid axioms are sound and complete for the implication from the recursive set (see (7)). They further showed that the set of implicates can be read off the appropriate DAG via the d-separation procedure. We show that every such exact implication can be relaxed, admitting a -relaxation, guaranteeing a bounded approximation for the implicates (CI relations) read off the DAG by d-separation.

We recall the definition of the recursive basis from (7):

 Σ\tiny def={(Xi;Ri|Bi):i∈[1,n],RiBi=U(i)} (8)

where and . We observe that , there is a single triple that mentions , and that is saturated.

We recall that is the set of polymatroids whose I-measure assigns non-negative values to the atoms (see Section 2.2.1).

###### Theorem 4.1.

Let be a recursive set of CIs (see (8)), and let . Then the following holds:

 Δn⊨EIΣ⇒τ iff Γn⊨h(Σ)≥h(τ) (9)

We note that the only-if direction of Theorem 4.1 is immediate, and follows from the non-negativity of Shannon’s information measures. We prove the other direction in Section 6. Theorem 4.1 states that it is enough that the exact implication holds on all of the positive polymatroids , because this implies the (even stronger!) statement .

### 4.2 Implication from Marginal CIs

We show that any implicate from a set of marginal CIs has an -approximation. This generalizes the result of Geiger, Paz, and Pearl [GEIGER1991128], which proved that the semigraphoid axioms are sound and complete for deriving marginal CIs.

###### Theorem 4.2.

Let be a set of marginal CIs, and be any CI.

 Γn⊨EIΣ⇒τ iff Γn⊨(|A||B|)h(Σ)≥h(τ) (10)

Also here, the only-if direction of Theorem 4.2 is immediate, and we prove the other direction in Section 7.

## 5 Properties of Exact Implication

In this section, we use the I-measure to characterize some general properties of exact implication in the set of positive polymatroids (Section 5.1), and the entire set of polymatroids (Section 5.2). The lemmas in this section will be used for proving the approximate implication guarantees presented in Section 4.

In what follows, is a set of RVs, denotes a set of triples representing mutual information terms, and denotes a single triple. We denote by the set of RVs mentioned in (e.g., if then ).

### 5.1 Exact implication in the set of positive polymatroids

###### Lemma 5.1.

The following holds:

 Δn⊨EIΣ⇒τ iff m(Σ)⊇m(τ)
###### Proof.

Suppose that , and let . By Theorem 2.4 there exists a positive polymatroid in with an -measure that takes the following non-negative values on its atoms: , and for any atom where . Since , then while . Hence, .

Now, suppose that . Then for any positive I-measure , we have that . By Theorem 2.2, is the unique signed measure on that is consistent with all of Shannon’s information measures. Therefore, . The result follows from the non-negativity of the Shannon information measures. ∎

An immediate consequence of Lemma 5.1 is that is a necessary condition for implication between polymatroids.

If then .

###### Proof.

If then it must hold for any subset of polymatroids, and in particular, . The result follows from Lemma 5.1. ∎

###### Lemma 5.3.

Let , and let such that . Then .

###### Proof.

Let , and suppose that . By Lemma 5.1, we have that . In other words, there is an atom such that . In particular, . Hence, , and by Lemma 5.1 we get that . ∎

### 5.2 Exact Implication in the set of polymatroids

The main technical result of this section is Lemma 5.6. We start with two short technical lemmas.

###### Lemma 5.4.

Let and be CIs such that , , and . Then, .

###### Proof.

Since , we denote by , , and . Also, denote by , . So, we have that: . By the chain rule, we have that:

 I(ZAA′X;ZBB′Y|C)= I(ZA;ZB|C)+I(A′X;ZB|CZA) +I(ZA;B′Y|ZBC)+I(X;Y|CZAZB) +I(X;B′|CZAZBY)+I(A′;B′Y|CZAZBX)

Noting that , we get that as required. ∎

###### Lemma 5.5.

Let be a set of triples such that for all . Likewise, let be a triple such that . Then:

 Γn⊨EIΣ⇒τ iff Γn−1⊨EIΣ⇒τ (11)
###### Proof.

Suppose that . Then there exists a polymatroid (Section 2.2) such that for all , and . We define as follows:

 g(A)=f(A) for all A⊆{X1,…,Xn−1} (12)

Since is a polymatroid, then so is . Further, since does not mention then, by (12), we have that for all . Hence, .

If . Then there exists a polymatroid such that for all , and . Define as follows:

 f(A)=g(A∖Xn) for all A⊆{X1,…,Xn} (13)

We claim that (i.e., is a polymatroid). It then follows that because by the assumption that and are subsets of , then for all . Hence, while .

We now prove the claim. First, by (13), we have that . We show that is monotonic. So let . If then and we have that:

 f(B)−f(A)=g(B)−g(A)≥B⊇Ag∈Γn−10

If then we let , and we have:

 f(B′Xn)−f(A)=(???)g(B′)−g(A)≥B′⊇A0

Finally, if , then by letting , , we have that:

 f(B′Xn)−f(A′Xn)=(???)g(B′)−g(A′)≥0

We now show that is submodular. Let . If then for every set . Since is submodular, then . If then we write and observe that, by (13): , , that , and that . Hence: . The case where is symmetrical. Finally, if then for all . Hence, for every in this set, we write . In particular, by (13) we have that , and the claim follows since . ∎

###### Lemma 5.6.

Let . If then there exists a triple such that:

1. , and

2. and .

###### Proof.

Let , where , , , and . Following [DBLP:journals/iandc/GeigerPP91], we construct the parity distribution as follows. We let all the RVs, except , be independent binary RVs with probability for each of their two values, and let be determined from as follows:

 a1=m∑i=2ai+ℓ∑i=1bi+k∑i=1ci(mod2) (14)

Let and . We denote by , and by the assignment restricted to the RVs . We show that if then the RVs in are pairwise independent. By the definition of we have that:

 P(D=d)=(12)|D∩U|P(DABC=dABC)

There are two cases with respect to . If then, by definition, , and overall we get that . Hence, the RVs in are pairwise independent. If , then since it holds that . To see this, observe that:

 P(a1=1|DABC∖{a1}) ={12if ∑y∈DABC∖{a1}y(mod2)=012if ∑y∈DABC∖{a1}y(mod2)=1

because if, w.l.o.g, , then implies that , and this is the case for precisely half of the assignments . Hence, for any such that it holds that , and therefore the RVs are pairwise independent.

By definition of entropy (see (4)) we have that for every binary RV in . Since the RVs in are pairwise independent then 222This is due to the chain rule of entropy, and the fact that if and are independent RVs then .. Furthermore, for any s.t. we have that:

 I(X;Y|Z) =H(XZ)+H(YZ)−H(Z)−H(XYZ) =|XZ|+|YZ|−|Z|−|XYZ| =|X|+|Y|+|Z|−|XYZ| =0

On the other hand, letting , then by chain rule for entropies, and noting that, by (14), , then:

 H(var(τ))=H(ABC) =H(a1A′BC) =H(a1|A′BC)+H(A′BC) =0+|ABC|−1=|ABC|−1.

and thus

 I(A;B|C) =H(AC)+H(BC)−H(C)−H(ABC) =|AC|+|BC|−|C|−(|ABC|−1) (15) =1

In other words, the parity distribution of (14) has an entropic function , such that for all where , while . Hence, if , then there must be a triple such that .

Now, suppose that and that . In other words, . We denote and . Therefore, we can write as where and . It is easily shown that if or then . Otherwise (i.e., and ), then due to the properties of the parity function, we have that . Noting that , we get that .

Overall, we showed that for all triples that do not meet the conditions of the lemma, it holds that