 # Bayesian test of significance for conditional independence: The multinomial model

Conditional independence tests (CI tests) have received special attention lately in Machine Learning and Computational Intelligence related literature as an important indicator of the relationship among the variables used by their models. In the field of Probabilistic Graphical Models (PGM)--which includes Bayesian Networks (BN) models--CI tests are especially important for the task of learning the PGM structure from data. In this paper, we propose the Full Bayesian Significance Test (FBST) for tests of conditional independence for discrete datasets. FBST is a powerful Bayesian test for precise hypothesis, as an alternative to frequentist's significance tests (characterized by the calculation of the p-value).

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Barlow and Pereira (1990) discuss a graphical approach to conditional independence. A probabilistic influence diagram is a directed acyclic graph (DAG) that helps to model statistical problems. The graph is composed of a set of nodes or vertices, representing the variables, and a set of arcs joining the nodes, representing the dependence relationships shared by these variables.

The construction of the model helps to understand the problem and gives a good representation of interdependence of the variables involved in the problem. The joint probability of these variable can be written as a product of conditional distributions, based on the relationships of independence and conditional independence among the variables involved in the problem.

Sometimes the interdependence of the variables is not known, and in this case, the model structure is required to be learnt from data. Algorithms such as the IC-Algorithm (Inferred Causation) described in Pearl and Verma (1995) are designed to uncover these structures from data. This algorithm uses a series of CI tests to remove and direct the arcs connecting the variables in the model, returning a DAG that minimally (with the minimum number of parameters, without loss of information) represents the variables in the problem.

The problem of learning DAG structures from data motivates the proposal of new powerful statistical tests for the hypothesis of conditional independence, since the accuracy of structures learnt are directly affected by errors committed by these tests. Recently proposed structure learning algorithms (see Cheng et al., 1997; Tsamardinos et al., 1997; Yehezkel and Lerner, 2009) indicate as main source of errors the results of CI tests.

In this paper, we propose the Full Bayesian Significance Test (FBST) for tests of conditional independence for discrete datasets. FBST is a powerful Bayesian test for precise hypothesis, and can be used to learn DAG structures from data, as an alternative to CI test currently used, such as Pearson’s test.

This paper is organized as follows. In Section 2 we review the Full Bayesian Significance Test (FBST). In Section 3, we review the FBST for composite hypothesis. Section 4 shows an example of test of conditional independence used to learn a simple model with 3 variables.

## 2 The Full Bayesian Significance Test

The Full Bayesian Significance Test (FBST) is presented by Pereira and Stern (1999) as a coherent Bayesian significance test for sharp hypothesis. In the FBST, the evidence for a precise hypothesis is computed.

This evidence is given by the complement of the probability of a credible set–called the tangent

set–which is a subset of the parameter space, where the posterior density of each of its elements is greater than the maximum of the posterior density over the Null hypothesis. A more formal definition is given below.

Consider a model in a statistical space described by the triple , where is the sample space; , the family of measurable subsets of ; and the parameter space: is a subset of .

Define a subset of the parameter space (tangent set), where the posterior density (denoted by ) of each element of this set is greater than .

 Tφ={θ∈Θ|fx(θ)>φ}

The credibility of

is given by its posterior probability:

 κ=∫Tφfx(θ)dθ=∫Θfx(θ)1Tφ(θ)dθ

, where is the indicator function:

 1Tφ(θ)={1if θ∈Tφ0otherwise

Defining the maximum of the posterior density over the Null hypothesis as , with maximum point at :

 θ∗0∈argmaxθ∈Θ0fx(θ), and f∗x=fx(θ∗)

, and defining the tangent set to the Null hypothesis . The credibility of is

The measure of evidence of the Null hypothesis (called e-value), which is the complement of the probability of the set , is defined as:

 Ev(H0)=1−κ∗=1−∫Θfx(θ)1T∗(θ)dθ

If the probability of the set is large, the null set is in a region of low probability and the evidence is against the Null hypothesis . But, if the probability of is small, then the null set is in a region of high probability, and the evidence supports the Null hypothesis.

### 2.1 FBST: Example of Tangent set

Figure 1 shows the tangent set for a Null hypothesis , for the posterior distribution given bellow, where

is the mean of a normal distribution and

, the precision

(the inverse of the variance

):

 fx(μ,τ)∝τ1.5e−τ(μ)2−1.5τ

## 3 FBST: Compositionality

The relationship between the credibility of a complex hypothesis , and its elementary constituent, , , under the Full Bayesian Significance Test (FBST), is analysed in Borges and Stern (2007).

For a given set of independent parameters , a complex hypothesis , such as:

 H:θ1∈ΘH1∧θ2∈ΘH2∧…∧θHk∈ΘHk

, where is a subset of the parameter space for , constrained to the hypothesis , can be decomposed in its elementary components (hypotheses):

 H1:θ1∈ΘH1 H2:θ2∈ΘH2 ⋯ Hk:θk∈ΘHk

, and the credibility of can be evaluated based on the credibility of these components. The evidence in favour of the complex hypothesis (measured by its e-value) can not be obtained directly from the evidence in favour of the elementary components, but based on their Truth Function (or cumulative surprise distribution) defined below.

For a given elementary component () of the complex hypothesis , is the point of maximum density of the posterior distribution () constrained to the subset of the parameter space defined by hypothesis :

 θ∗j∈% argmaxθj∈ΘHjfx(θj) and f∗j=fx(θ∗j)

The truth function is the probability of the region of the parameter space, where the posterior density is lower or equal than a value :

 Rj(f)={θj∈Θj|fx(θj)≤f} Wj(f)=∫Rj(f)fx(θj)dθj

And the evidence supporting the hypothesis is:

 Ev(Hj)=Wj(f∗j)

The evidence supporting the complex hypothesis can be then described in terms of the truth function of its components, as the Mellin convolution of these functions:

 Ev(H)=W1⊗W2⊗W3⊗…⊗Wk(f∗1⋅f∗2⋅f∗3⋅…⋅f∗k)

Where the Mellin Convolution of two truth functions, , is the distribution function:

 W1⊗W2(x)=∫x0W1(xy)W2(y)dy

### 3.1 Numerical Method for Convolution and Condensation

Williamson and Downs (1990)

investigate numerical procedures to handle arithmetic operations for random variables. Replacing basic operations of arithmetic, used for fixed numbers, by convolutions, they show how to calculate the joint distribution for a set of random variables and their respective upper and lower bounds.

The convolution for the multiplication of two random variables and (

) can be written using their respective cumulative distribution functions

and :

 FZ(z)=∫z0FX1(zt)dFX2(t)

The algorithm for the numerical calculation of the distribution of the product of two independent random variables ( and ), using their discretized

marginal probability distributions (

and ) is shown in Algorithm 1 (an algorithm for a discretization procedure is given in Williamson and Downs 1990, page 188).

The numerical convolution of two distributions with bins returns a distribution with bins. For a sequence of operations, this would be a problem, since the result of each operation would be larger than the input for the operations. The authors, hence, propose a simple method to reduce the size of the output to bins, without introducing error to the result. This operation is called condensation, and it returns the upper and lower bounds of each of the bins for the distribution resulting from the convolution. The algorithm for the condensation process is shown in Algorithm 2.

#### 3.1.1 Vertical Condensation

Kaplan and Lin (1987) propose a vertical condensation procedure for discrete probability calculations, where the condensation is done using the vertical axis, instead of the horizontal axis, as in Williamson and Downs (1990).

The advantage of this approach is that it provides more control over the representation of the distribution, since, instead of selecting an interval of the domain of the cumulative distribution function (values assumed by the random variable) as a bin, we select the interval of the range of the cumulative distribution in that should be represented by each bin.

In this case, it is also possible to concentrate the attention in a specific region of the distribution. For example, if there is a greater interest in the behaviour of the tail of the distribution, the size of the bins can be reduced in this region, consequently, increasing the number of bins necessary to represent the tail of the distribution.

An example of convolution followed by condensation procedure, using both approaches is given in Section 3.2. We used, for this example, discretization and condensation procedures with bins uniformly

distributed over both axes. At the end of the condensation procedure, using the first approach, the bins are uniformly distributed

horizontally (over the sample space of the variable). For the second approach, the bins of the cumulative probability distribution are uniformly distributed over the vertical axis in the interval . Algorithm 3 shows the condensation with bins uniformly distributed over the vertical axis.

### 3.2 Mellin Convolution: Example

An example of Mellin convolution to find the product of two random variable and

, both with a Log-normal distribution, is given.

Assume and

, continuous random variables, such that.

 Y1∼lnN(μ1,σ21)% , and Y2∼lnN(μ2,σ22)

, we denote the cumulative distributions of and , by and , respectively, i.e.,

 W1(y1)=∫y1−∞fY1(t)dt, and W2(y2)=∫y2−∞fY2(t)dt

, where and are the density functions of and , respectively. These distributions can be written as a function of two normally distributed random variables and :

 ln(Y1)=X1∼N(μ1,σ21) ln(Y2)=X2∼N(μ2,σ22)

And we can find the distribution of the product of these random variables (), using simple arithmetic operations, to be also Log-normal:

 Y1=eX1 and Y2=eX2 Y1⋅Y2=eX1+X2 ln(Y1⋅Y2)=X1+X2∼N(μ1+μ2,σ21+σ22) ∴Y1⋅Y2∼lnN(μ1+μ2,σ21+σ22)

The cumulative density function of () is defined as:

 W12(y12)=∫y12−∞fY1⋅Y2(t)dt

, where is the density function of .

Figure 2 shows the cumulative distribution functions of and discretized with bins uniformly distributed over both x and y axes (horizontal and vertical discretizations). Figure 3 shows an example of convolution followed by condensation, using both horizontal and vertical condensation procedures, and the true distribution of the product of two variables with Log-normal distributions.

## 4 Test of Conditional Independence in Contingency table using FBST

We now apply the methods shown in the previous sections to find the evidence of a complex Null hypothesis of conditional independence, for discrete variables.

Given the discrete random variables

, and , with taking values in . The test of conditional independence can be written as the complex Null hypothesis :

 H:[Y⊥⊥Z|X=1]∧[Y⊥⊥Z|X=2]∧⋯∧[Y⊥⊥Z|X=k]

The hypothesis , can be decomposed in its elementary components:

 H1:Y⊥⊥Z|X=1 H2:Y⊥⊥Z|X=2 ⋯ Hk:Y⊥⊥Z|X=k

Notice that the hypotheses are independent: for each value taken by , the values taken by variables and are assumed to be random observations drawn from some distribution

. Each of the elementary components is a hypothesis of independence in a contingency table. Table

1 shows the contingency table for and taking values, respectively, in and .

The test of the hypothesis

can be set-up using the multinomial distribution for the cell counts of the contingency table and its natural conjugate prior, the Dirichlet distribution for the vector of parameters

.

For a given array of hyperparameters

, the Dirichlet distribution is defined as:

 f(θx|αx)=Γ(r,c∑y,zαyzx)r,c∏y,zθαyzx−1yzxΓ(αyzx) (1)

The multinomial likelihood, for the given contingency table, assuming the array of observations and the sum of the observations , is:

 f(nx|θx)=n..x!r,c∏y,zθnyzxyzxnyzx! (2)

The posterior distribution will be, then, a Dirichlet distribution :

 fn(θx)∝r,c∏y,zθαyzx+nyzx−1yzx (3)

Under the hypothesis , we have . In this case, we have that the joint distribution is equal to the product of the marginals: . We can define this condition using the array of parameters , in this case, we have:

 Hx:θyzx=θ.zx⋅θy.x,∀y,z (4)

, where and .

The point of maximum density of the posterior distribution constrained to the subset of the parameter space defined by the hypothesis

can be estimated using the

maximum a posteriori (MAP) estimator under the hypothesis (the mode of the parameters ). The maximum density () will be the posterior density evaluated at this point:

 θ∗yzx=nHxyzx+αyzx−1nHx..x+α..x−r⋅c and f∗x=fn(θ∗x) (5)

, where .

The evidence supporting can be written in terms of the truth function , as defined in Section 3:

 Rx(f)={θx∈Θx|fx(θx)≤f} (6) Wx(f)=∫Rx(f)fn(θx)dθx∝∫Rx(f)r,c∏y,zθαyzx+nyzx−1yzxdθx (7)

And the evidence supporting the hypothesis , is:

 Ev(Hx)=Wx(f∗x) (8)

Finally the evidence supporting the hypothesis of conditional independence (), will be given by the convolution of the truth functions evaluated at the product of the points of maximum posterior density, for each component of the hypothesis :

 Ev(H)=W1⊗W2⊗…⊗Wk(f∗1⋅f∗2⋅…⋅f∗k) (9)

The e-value for hypothesis can be found using modern mathematical methods of integration. An example is given in the next section, where the numerical convolution followed by the condensation procedures described in Section 3.1 are used. The application of the method of horizontal condensation results in a interval for the e-value (found using the lower and upper bounds resulting from the condensation process), and in a single value for the vertical procedure.

### 4.1 Example of CI test using FBST

In this section we describe an example of CI test using the Full Bayesian Significance Test (FBST) for conditional independence using samples from two different model. For both models, we test if the variable is conditionally independent of given .

The two probabilistic graphical models ( and ) are shown in Figure 4, where all the three variables , and assume values in : in the first model (Figure 4), the hypothesis of independence is true, while in the second model (Figure 4), the same hypothesis is false

. The synthetic conditional probability distribution tables (CPTs) used to generate the samples are given in Appendix

A.

We calculate the intervals for the e-values, and compare them, for the hypothesis , of conditional independence, for both models: and . The complexity hypothesis can be decomposed in elementary components:

 H1:Y⊥⊥Z|X=1 H2:Y⊥⊥Z|X=2 H3:Y⊥⊥Z|X=3

For each model, random observation have been generated, the contingency table of and for each value of are shown in Table 2. The hyperparameters of the prior distribution were all set to 1 , the priori is then equivalent to a uniform distribution (from Equation 1):

 α1=α2=α3=[1,1,1] f(θ1|α1)=f(θ3|α3)=f(θ3|α3)=1

The posterior distribution, found using Equations 2 and 3, is then:

 fn(θ1)∝3,3∏y=1,z=1θnyz1yz1,fn(θ2)∝3,3∏y=1,z=1θnyz2yz2,fn(θ3)∝3,3∏y=1,z=1θnyz3yz3

For example, for the given contingency table for Model , when (Table 2) the posterior distribution is:

 fn(θ2)∝θ42112⋅θ41122⋅θ323132⋅θ39212⋅θ41222⋅θ341232⋅θ15312⋅θ21322⋅θ171332

And the point of highest density, for this example, under the hypothesis of independence (Equations 4 and 5) was found to be:

 θ∗2≈[0.036,0.039,0.317,0.038,0.041,0.329,0.019,0.020,0.162]

The truth function and the evidence supporting the hypothesis of independence given (hypothesis ) for model , as given in Equations 6 and 8, are:

 R2(f)={θ2∈Θ2|fn(θ2)≤f} W2(f)=∫R2(f)fn(θ2)dθ2 EvM1(H2)=W2(fn(θ∗2))

We used methods of numerical integration to find the e-value of the elementary components of hypothesis (, and ), the results for each model are given bellow.

E-values found using horizontal discretization:

 EvM1(H1)=0.9878,EvM1(H2)=0.9806 and EvM1(H3)=0.1066 EvM2(H1)=0.0004,EvM2(H2)=0.0006 and EvM2(H3)=0.0004

, and the e-values found using vertical discretization:

 EvM1(H1)=0.99,EvM1(H2)=0.98 and EvM1(H3)=0.11 EvM2(H1)=0.01,EvM2(H2)=0.01 and EvM2(H3)=0.01

Figure 5 shows the histogram of the Truth functions , and for the Model ( and are conditionally independent given ). In Figures 55 and 5, bins are uniformly distributed over the axis (using the empirical values of and ). In Figures 55 and 5, bins are uniformly distributed over the axis (each bin represents an increase in in density from the previous bin). Notice that the functions evaluated at the maximum posterior density over the respective hypothesis , in red, correspond to the e-values found (e.g., , for the horizontal discretization in Figure 5).

The evidence supporting the hypothesis of conditional independence , as in Equation 9, for each model, will be:

 Ev(H)=W1⊗W2⊗W3(fn(θ∗1)⋅fn(θ∗2)⋅fn(θ∗3))

The convolution has commutative property, therefore the order of the convolutions is irrelevant:

 W1⊗W2⊗W3(f)=W3⊗W2⊗W1(f)

, using the algorithm for numerical convolution described in Algorithm 1 we found the convolution of the truth functions and , resulting in a cumulative function () with bins ( bins). We, then, performed the condensation procedures described in Algorithms 2 3, reducing the cumulative distribution to 100 bins, with lower and upper bounds ( and ) for the horizontal condensation. The results are shown in Figures 6 and 6 for Model (horizontal and vertical condensations, respectively), and,  7 and 7 for model .

The convolution of and was, then, performed, followed by condensation. The results, are shown in Figures 6 and 6 (model ), and 7 and 7 (model ).

The e-values supporting the hypothesis of conditional independence for both models are given bellow.

The intervals for the e-values found using horizontal discretization and condensation were:

 EvM1(H)=[0.587427,0.718561] EvM2(H)=[8⋅10−12,6.416⋅10−9]

, and the e-values found using vertical discretization and condensation were:

 EvM1(H)=0.95 EvM2(H)=0.01

These results show strong evidence supporting the hypothesis of conditional independence between and given for the model (using both discretization/condensation procedures). And no evidence supporting the same hypothesis for the second model. This result is very relevant and promising as a motivation for further studies of the use of FBST as a CI test for the structure learning of graphical models.

## 5 Conclusion and Future Work

This paper gives the framework to perform tests of conditional independence for discrete datasets using the Full Bayesian Significance Test (FBST). A simple example of application of this test to learn the structure of a directed acyclic graph is given using two different models. The result found in this paper suggests that FBST should be considered as a good alternative to perform CI tests for the task of learning structures of probabilistic graphical models from data.

Future researches include the use of FBST in an algorithm to learn structures of graphs with larger number of variables; the increase in performance of the mathematical methods used to calculate the e-values (as learning DAG structures from data requires an exponential number of CI tests to be performed, each CI test needs to be performed faster); and an empirical evaluation of the threshold for e-values in order to define conditional independence versus dependence, by minimizing a linear combination of errors of type I and II (incorrect rejection of true hypothesis of conditional independence and failure to reject a false hypothesis of conditional independence).

## References

• Barlow and Pereira (1990) Barlow, R. E., & Pereira, C.A.B. (1990). Conditional independence and probabilistic influence diagrams. Lecture Notes-Monograph Series, pp. 19–33.
• Basu and Pereira (2011) Basu, D., & Pereira, C.A.B. (2011). Conditional independence in statistics. In Selected Works of Debabrata Basu, pp. 371–384. Springer New York.
• Borges and Stern (2007) Borges, W., & Stern, J. M. (2007). The rules of logic composition for the Bayesian epistemic e-values. Logic Journal of IGPL, 15(5-6), pp. 401–420.
• Cheng et al. (1997) Cheng, J., Bell, D. A., & Liu, W. (1997, January). Learning belief networks from data: An information theory based approach. In Proceedings of the sixth international conference on Information and knowledge management, pp. 325–331. ACM.
• Kaplan and Lin (1987) Kaplan, S., & Lin, J. C. (1987). An improved condensation procedure in discrete probability distribution calculations. Risk Analysis, 7(1), 15–19.
• Pereira and Stern (1999) Pereira, C.A.B., & Stern, J.M. (1999) Evidence and Credibility: Full Bayesian Significance Test for Precise Hypotheses. Entropy, 1, pp. 99–-110.
• Pearl and Verma (1995) Pearl, J., & Verma, T. S. (1995). A theory of inferred causation. Studies in Logic and the Foundations of Mathematics, 134, pp. 789–811.
• Tsamardinos et al. (1997) Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1), pp. 31–78.
• Williamson (1989) Williamson, R. C. (1989). Probabilistic arithmetic (Doctoral dissertation, University of Queensland).
• Williamson and Downs (1990) Williamson, R. C., & Downs, T. (1990) Probabilistic arithmetic. I. Numerical methods for calculating convolutions and dependency bounds. International Journal of Approximate Reasoning, 4(2), pp. 89–-158.
• Yehezkel and Lerner (2009) Yehezkel, R., & Lerner, B. (2009). Bayesian network structure learning by recursive autonomy identification. The Journal of Machine Learning Research, 10, pp. 1527–1570.