Bayesian test of significance for conditional independence: The multinomial model

06/16/2013 ∙ by Pablo de Morais Andrade, et al. ∙ Universidade de São Paulo 0

Conditional independence tests (CI tests) have received special attention lately in Machine Learning and Computational Intelligence related literature as an important indicator of the relationship among the variables used by their models. In the field of Probabilistic Graphical Models (PGM)--which includes Bayesian Networks (BN) models--CI tests are especially important for the task of learning the PGM structure from data. In this paper, we propose the Full Bayesian Significance Test (FBST) for tests of conditional independence for discrete datasets. FBST is a powerful Bayesian test for precise hypothesis, as an alternative to frequentist's significance tests (characterized by the calculation of the p-value).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Barlow and Pereira (1990) discuss a graphical approach to conditional independence. A probabilistic influence diagram is a directed acyclic graph (DAG) that helps to model statistical problems. The graph is composed of a set of nodes or vertices, representing the variables, and a set of arcs joining the nodes, representing the dependence relationships shared by these variables.

The construction of the model helps to understand the problem and gives a good representation of interdependence of the variables involved in the problem. The joint probability of these variable can be written as a product of conditional distributions, based on the relationships of independence and conditional independence among the variables involved in the problem.

Sometimes the interdependence of the variables is not known, and in this case, the model structure is required to be learnt from data. Algorithms such as the IC-Algorithm (Inferred Causation) described in Pearl and Verma (1995) are designed to uncover these structures from data. This algorithm uses a series of CI tests to remove and direct the arcs connecting the variables in the model, returning a DAG that minimally (with the minimum number of parameters, without loss of information) represents the variables in the problem.

The problem of learning DAG structures from data motivates the proposal of new powerful statistical tests for the hypothesis of conditional independence, since the accuracy of structures learnt are directly affected by errors committed by these tests. Recently proposed structure learning algorithms (see Cheng et al., 1997; Tsamardinos et al., 1997; Yehezkel and Lerner, 2009) indicate as main source of errors the results of CI tests.

In this paper, we propose the Full Bayesian Significance Test (FBST) for tests of conditional independence for discrete datasets. FBST is a powerful Bayesian test for precise hypothesis, and can be used to learn DAG structures from data, as an alternative to CI test currently used, such as Pearson’s test.

This paper is organized as follows. In Section 2 we review the Full Bayesian Significance Test (FBST). In Section 3, we review the FBST for composite hypothesis. Section 4 shows an example of test of conditional independence used to learn a simple model with 3 variables.

2 The Full Bayesian Significance Test

The Full Bayesian Significance Test (FBST) is presented by Pereira and Stern (1999) as a coherent Bayesian significance test for sharp hypothesis. In the FBST, the evidence for a precise hypothesis is computed.

This evidence is given by the complement of the probability of a credible set–called the tangent

set–which is a subset of the parameter space, where the posterior density of each of its elements is greater than the maximum of the posterior density over the Null hypothesis. A more formal definition is given below.

Consider a model in a statistical space described by the triple , where is the sample space; , the family of measurable subsets of ; and the parameter space: is a subset of .

Define a subset of the parameter space (tangent set), where the posterior density (denoted by ) of each element of this set is greater than .

The credibility of

is given by its posterior probability:

, where is the indicator function:

Defining the maximum of the posterior density over the Null hypothesis as , with maximum point at :

, and defining the tangent set to the Null hypothesis . The credibility of is

The measure of evidence of the Null hypothesis (called e-value), which is the complement of the probability of the set , is defined as:

If the probability of the set is large, the null set is in a region of low probability and the evidence is against the Null hypothesis . But, if the probability of is small, then the null set is in a region of high probability, and the evidence supports the Null hypothesis.

2.1 FBST: Example of Tangent set

Figure 1 shows the tangent set for a Null hypothesis , for the posterior distribution given bellow, where

is the mean of a normal distribution and

, the precision

(the inverse of the variance

):

3 FBST: Compositionality

The relationship between the credibility of a complex hypothesis , and its elementary constituent, , , under the Full Bayesian Significance Test (FBST), is analysed in Borges and Stern (2007).

For a given set of independent parameters , a complex hypothesis , such as:

(a) Posterior . Red line: .
(b) Posterior . Red line: .
(c) Contours of . Red line: .
Figure 1: Example of tangent set for a Null hypothesis . In (a) and (b) the posterior distribution is shown, with the red line representing the points in the Null hypothesis (). In (c) the contours of show that the points of maximum density in the Null hypothesis have density 0.1037 (). The tangent set of the Null hypothesis is the set of points inside the green contour line (points with density greater than ), and the e-value of is the complement of the integral of bounded by the green contour line.

, where is a subset of the parameter space for , constrained to the hypothesis , can be decomposed in its elementary components (hypotheses):

, and the credibility of can be evaluated based on the credibility of these components. The evidence in favour of the complex hypothesis (measured by its e-value) can not be obtained directly from the evidence in favour of the elementary components, but based on their Truth Function (or cumulative surprise distribution) defined below.

For a given elementary component () of the complex hypothesis , is the point of maximum density of the posterior distribution () constrained to the subset of the parameter space defined by hypothesis :

The truth function is the probability of the region of the parameter space, where the posterior density is lower or equal than a value :

And the evidence supporting the hypothesis is:

The evidence supporting the complex hypothesis can be then described in terms of the truth function of its components, as the Mellin convolution of these functions:

Where the Mellin Convolution of two truth functions, , is the distribution function:

3.1 Numerical Method for Convolution and Condensation

Williamson and Downs (1990)

investigate numerical procedures to handle arithmetic operations for random variables. Replacing basic operations of arithmetic, used for fixed numbers, by convolutions, they show how to calculate the joint distribution for a set of random variables and their respective upper and lower bounds.

The convolution for the multiplication of two random variables and (

) can be written using their respective cumulative distribution functions

and :

The algorithm for the numerical calculation of the distribution of the product of two independent random variables ( and ), using their discretized

marginal probability distributions (

and ) is shown in Algorithm 1 (an algorithm for a discretization procedure is given in Williamson and Downs 1990, page 188).

The numerical convolution of two distributions with bins returns a distribution with bins. For a sequence of operations, this would be a problem, since the result of each operation would be larger than the input for the operations. The authors, hence, propose a simple method to reduce the size of the output to bins, without introducing error to the result. This operation is called condensation, and it returns the upper and lower bounds of each of the bins for the distribution resulting from the convolution. The algorithm for the condensation process is shown in Algorithm 2.

1:procedure Convolution() Discrete p.d.f. of and
2:      and has bins
3:     
4:     for  do and have bins
5:         for  do
6:              
7:         end for
8:     end for
9:     
10:     for  do find c.d.f. of
11:         
12:         
13:     end for
14:     return Discrete c.d.f. of
15:end procedure
Algorithm 1 Find distribution of the product of two random variables.
1:procedure HorizontalCondensation() Histogram of a c.d.f. with bins
2:     
3:     
4:     for  do
5:          lower bound after condensation
6:          upper bound after condensation
7:     end for
8:     return Histograms with upper/lower bounds
9:end procedure
Algorithm 2 Find upper lower bound for a c.d.f. for condensation.

3.1.1 Vertical Condensation

Kaplan and Lin (1987) propose a vertical condensation procedure for discrete probability calculations, where the condensation is done using the vertical axis, instead of the horizontal axis, as in Williamson and Downs (1990).

The advantage of this approach is that it provides more control over the representation of the distribution, since, instead of selecting an interval of the domain of the cumulative distribution function (values assumed by the random variable) as a bin, we select the interval of the range of the cumulative distribution in that should be represented by each bin.

In this case, it is also possible to concentrate the attention in a specific region of the distribution. For example, if there is a greater interest in the behaviour of the tail of the distribution, the size of the bins can be reduced in this region, consequently, increasing the number of bins necessary to represent the tail of the distribution.

An example of convolution followed by condensation procedure, using both approaches is given in Section 3.2. We used, for this example, discretization and condensation procedures with bins uniformly

distributed over both axes. At the end of the condensation procedure, using the first approach, the bins are uniformly distributed

horizontally (over the sample space of the variable). For the second approach, the bins of the cumulative probability distribution are uniformly distributed over the vertical axis in the interval . Algorithm 3 shows the condensation with bins uniformly distributed over the vertical axis.

1:procedure VerticalCondensation(,,) Histograms of a c.d.f. and p.d.f., and breaks in the x axis.
2:      uniform breaks in axis
3:     
4:     
5:     
6:     
7:     for all  do
8:          find break to create current bin
9:         if  then if the break is within a current bin
10:              
11:              
12:              
13:              
14:              
15:              
16:         else
17:              
18:              
19:         end if
20:         
21:         
22:     end for
23:     return Histograms with upper/lower bounds
24:end procedure
Algorithm 3 Condensation with bins vertically uniformly distributed.

3.2 Mellin Convolution: Example

An example of Mellin convolution to find the product of two random variable and

, both with a Log-normal distribution, is given.



Assume and

, continuous random variables, such that.

, we denote the cumulative distributions of and , by and , respectively, i.e.,

, where and are the density functions of and , respectively. These distributions can be written as a function of two normally distributed random variables and :

And we can find the distribution of the product of these random variables (), using simple arithmetic operations, to be also Log-normal:

The cumulative density function of () is defined as:

, where is the density function of .

Figure 2 shows the cumulative distribution functions of and discretized with bins uniformly distributed over both x and y axes (horizontal and vertical discretizations). Figure 3 shows an example of convolution followed by condensation, using both horizontal and vertical condensation procedures, and the true distribution of the product of two variables with Log-normal distributions.

(a) : Horinzontal discretization
(b) : Vertical discretization

(c) : Horinzontal discretization
(d) : Vertical discretization

Figure 2: Example of different discretization methods for the representation of the c.d.f. of two random variables ( and ) with Log-normal distribution. In (a) and (c) the c.d.f. of and , respectively, with bins uniformly distributed over the x-axis are shown, in (b) and (d) the c.d.f. of and , respectively, with bins uniformly distributed over the y-axis.
(a) : Horizontal condensation
(b) : Vertical condensation

(c) : Horizontal discretizarion
(d) : Vertical discretizarion
Figure 3: Example of convolution of two random variables ( and ) with Log-normal distribution. The result of the convolution , followed by horizontal condensation (bins uniformly distributed over x-axis) is shown in (a), and by vertical condensation (bins uniformly distributed over y-axis) is shown in (b). The true distribution of the product is shown in (c) and (d), respectively, for horizontal and vertical discretization procedures.

4 Test of Conditional Independence in Contingency table using FBST

We now apply the methods shown in the previous sections to find the evidence of a complex Null hypothesis of conditional independence, for discrete variables.

Given the discrete random variables

, and , with taking values in . The test of conditional independence can be written as the complex Null hypothesis :

The hypothesis , can be decomposed in its elementary components:

Notice that the hypotheses are independent: for each value taken by , the values taken by variables and are assumed to be random observations drawn from some distribution

. Each of the elementary components is a hypothesis of independence in a contingency table. Table 

1 shows the contingency table for and taking values, respectively, in and .

Table 1: Contingency table of and for (hypothesis ): is the count of , when .

The test of the hypothesis

can be set-up using the multinomial distribution for the cell counts of the contingency table and its natural conjugate prior, the Dirichlet distribution for the vector of parameters

.

For a given array of hyperparameters

, the Dirichlet distribution is defined as:

(1)

The multinomial likelihood, for the given contingency table, assuming the array of observations and the sum of the observations , is:

(2)

The posterior distribution will be, then, a Dirichlet distribution :

(3)

Under the hypothesis , we have . In this case, we have that the joint distribution is equal to the product of the marginals: . We can define this condition using the array of parameters , in this case, we have:

(4)

, where and .

The point of maximum density of the posterior distribution constrained to the subset of the parameter space defined by the hypothesis

can be estimated using the

maximum a posteriori (MAP) estimator under the hypothesis (the mode of the parameters ). The maximum density () will be the posterior density evaluated at this point:

(5)

, where .

The evidence supporting can be written in terms of the truth function , as defined in Section 3:

(6)
(7)

And the evidence supporting the hypothesis , is:

(8)

Finally the evidence supporting the hypothesis of conditional independence (), will be given by the convolution of the truth functions evaluated at the product of the points of maximum posterior density, for each component of the hypothesis :

(9)

The e-value for hypothesis can be found using modern mathematical methods of integration. An example is given in the next section, where the numerical convolution followed by the condensation procedures described in Section 3.1 are used. The application of the method of horizontal condensation results in a interval for the e-value (found using the lower and upper bounds resulting from the condensation process), and in a single value for the vertical procedure.

4.1 Example of CI test using FBST

In this section we describe an example of CI test using the Full Bayesian Significance Test (FBST) for conditional independence using samples from two different model. For both models, we test if the variable is conditionally independent of given .

(a)
(b)
Figure 4: Simple probabilistic graphical models. In (a) model , where is conditionally independent of given , in (b) model , where is not conditionally independent of given .

The two probabilistic graphical models ( and ) are shown in Figure 4, where all the three variables , and assume values in : in the first model (Figure 4), the hypothesis of independence is true, while in the second model (Figure 4), the same hypothesis is false

. The synthetic conditional probability distribution tables (CPTs) used to generate the samples are given in Appendix 

A.

We calculate the intervals for the e-values, and compare them, for the hypothesis , of conditional independence, for both models: and . The complexity hypothesis can be decomposed in elementary components:

(a) Model (for )
241 187 44 472
139 130 30 299
364 302 70 736
744 619 144 1507
(b) Model (for )
228 179 39 446
25 33 211 269
482 75 208 765
735 287 458 1048

(c) Model (for )
42 41 323 406
39 41 341 421
15 21 171 207
96 103 835 1034
(d) Model (for )
77 85 248 410
165 135 120 420
188 21 24 233
430 241 392 1036
(e) Model (for )
282 35 151 468
131 37 79 247
1055 143 546 1744
1468 215 776 2459
(f) Model (for )
40 87 354 481
119 104 27 250
305 1049 372 1726
464 1240 753 2457
Table 2: Contingency tables of and for a given the value of for 5,000 random samples. In (a),(c),(e) samples from model (Figure 4) for , and , respectivelly, in (b),(d),(f) samples from model (Figure 4) for , and , respectivelly

For each model, random observation have been generated, the contingency table of and for each value of are shown in Table 2. The hyperparameters of the prior distribution were all set to 1 , the priori is then equivalent to a uniform distribution (from Equation 1):

The posterior distribution, found using Equations 2 and 3, is then:

For example, for the given contingency table for Model , when (Table 2) the posterior distribution is:

And the point of highest density, for this example, under the hypothesis of independence (Equations 4 and 5) was found to be:

The truth function and the evidence supporting the hypothesis of independence given (hypothesis ) for model , as given in Equations 6 and 8, are:

We used methods of numerical integration to find the e-value of the elementary components of hypothesis (, and ), the results for each model are given bellow.

E-values found using horizontal discretization:

, and the e-values found using vertical discretization:

Figure 5 shows the histogram of the Truth functions , and for the Model ( and are conditionally independent given ). In Figures 55 and 5, bins are uniformly distributed over the axis (using the empirical values of and ). In Figures 55 and 5, bins are uniformly distributed over the axis (each bin represents an increase in in density from the previous bin). Notice that the functions evaluated at the maximum posterior density over the respective hypothesis , in red, correspond to the e-values found (e.g., , for the horizontal discretization in Figure 5).

The evidence supporting the hypothesis of conditional independence , as in Equation 9, for each model, will be:

The convolution has commutative property, therefore the order of the convolutions is irrelevant:

, using the algorithm for numerical convolution described in Algorithm 1 we found the convolution of the truth functions and , resulting in a cumulative function () with bins ( bins). We, then, performed the condensation procedures described in Algorithms 2 3, reducing the cumulative distribution to 100 bins, with lower and upper bounds ( and ) for the horizontal condensation. The results are shown in Figures 6 and 6 for Model (horizontal and vertical condensations, respectively), and,  7 and 7 for model .

The convolution of and was, then, performed, followed by condensation. The results, are shown in Figures 6 and 6 (model ), and 7 and 7 (model ).

The e-values supporting the hypothesis of conditional independence for both models are given bellow.

The intervals for the e-values found using horizontal discretization and condensation were:

, and the e-values found using vertical discretization and condensation were:

These results show strong evidence supporting the hypothesis of conditional independence between and given for the model (using both discretization/condensation procedures). And no evidence supporting the same hypothesis for the second model. This result is very relevant and promising as a motivation for further studies of the use of FBST as a CI test for the structure learning of graphical models.

(a) for Model , in red.
Horizontal Discretization.
(b) for Model , in red.
Vertical Discretization.

(c) , for Model , in red.
Horizontal Discretization.
(d) , for Model , in red.
Vertical Discretization.
(e) , for Model , in red.
Horizontal Discretization.
(f) , for Model , in red.
Vertical Discretization.
Figure 5: Histogram with 100 bins of the truth functions for the Model (Figure 4), for each value of . In red, the maximum posterior density under the respective elementary component (, and ) of the hypothesis of conditional independence , for both horizontal and vertical discretization procedures.
(a) for Model .Horizontal Discretization.
(b) for Model .Vertical Discretization.

(c) for Model .Horizontal Discretization.
(d) for Model .Vertical Discretization.

Figure 6: Histogram with 100 bins of the resulting convolutions for Model : (a) with horizontal discretization; (b) with vertical discretization; (c) with horizontal discretization; (d) with vertical discretization. In red in (c) and (d), the bin representing the product of maximum posterior density under the elementary components (, and ) of the hypothesis of conditional independence for model .
(a) for Model .Horizontal Discretization.
(b) for Model .Vertical Discretization.

(c) for Model .Horizontal Discretization.
(d) for Model .Vertical Discretization.
Figure 7: Histogram with 100 bins of the resulting convolutions for Model : (a) with horizontal discretization; (b) with vertical discretization; (c) with horizontal discretization; (d) with vertical discretization. In red in (c) and (d), the bin representing the product of maximum posterior density under the elementary components (, and ) of the hypothesis of conditional independence for model .

5 Conclusion and Future Work

This paper gives the framework to perform tests of conditional independence for discrete datasets using the Full Bayesian Significance Test (FBST). A simple example of application of this test to learn the structure of a directed acyclic graph is given using two different models. The result found in this paper suggests that FBST should be considered as a good alternative to perform CI tests for the task of learning structures of probabilistic graphical models from data.

Future researches include the use of FBST in an algorithm to learn structures of graphs with larger number of variables; the increase in performance of the mathematical methods used to calculate the e-values (as learning DAG structures from data requires an exponential number of CI tests to be performed, each CI test needs to be performed faster); and an empirical evaluation of the threshold for e-values in order to define conditional independence versus dependence, by minimizing a linear combination of errors of type I and II (incorrect rejection of true hypothesis of conditional independence and failure to reject a false hypothesis of conditional independence).

References

  • Barlow and Pereira (1990) Barlow, R. E., & Pereira, C.A.B. (1990). Conditional independence and probabilistic influence diagrams. Lecture Notes-Monograph Series, pp. 19–33.
  • Basu and Pereira (2011) Basu, D., & Pereira, C.A.B. (2011). Conditional independence in statistics. In Selected Works of Debabrata Basu, pp. 371–384. Springer New York.
  • Borges and Stern (2007) Borges, W., & Stern, J. M. (2007). The rules of logic composition for the Bayesian epistemic e-values. Logic Journal of IGPL, 15(5-6), pp. 401–420.
  • Cheng et al. (1997) Cheng, J., Bell, D. A., & Liu, W. (1997, January). Learning belief networks from data: An information theory based approach. In Proceedings of the sixth international conference on Information and knowledge management, pp. 325–331. ACM.
  • Kaplan and Lin (1987) Kaplan, S., & Lin, J. C. (1987). An improved condensation procedure in discrete probability distribution calculations. Risk Analysis, 7(1), 15–19.
  • Pereira and Stern (1999) Pereira, C.A.B., & Stern, J.M. (1999) Evidence and Credibility: Full Bayesian Significance Test for Precise Hypotheses. Entropy, 1, pp. 99–-110.
  • Pearl and Verma (1995) Pearl, J., & Verma, T. S. (1995). A theory of inferred causation. Studies in Logic and the Foundations of Mathematics, 134, pp. 789–811.
  • Tsamardinos et al. (1997) Tsamardinos, I., Brown, L. E., & Aliferis, C. F. (2006). The max-min hill-climbing Bayesian network structure learning algorithm. Machine learning, 65(1), pp. 31–78.
  • Williamson (1989) Williamson, R. C. (1989). Probabilistic arithmetic (Doctoral dissertation, University of Queensland).
  • Williamson and Downs (1990) Williamson, R. C., & Downs, T. (1990) Probabilistic arithmetic. I. Numerical methods for calculating convolutions and dependency bounds. International Journal of Approximate Reasoning, 4(2), pp. 89–-158.
  • Yehezkel and Lerner (2009) Yehezkel, R., & Lerner, B. (2009). Bayesian network structure learning by recursive autonomy identification. The Journal of Machine Learning Research, 10, pp. 1527–1570.

Appendix A Appendix

(a) CPT of
p()
1 0.3
2 0.2
3 0.5
(b) CPT of given
p(=1) p(=2) p(=3)
1 0.3 0.4 0.2
2 0.2 0.4 0.1
3 0.5 0.2 0.7

(c) CPT of given
p(=1) p(=2) p(=3)
1 0.5 0.1 0.6
2 0.4 0.1 0.1
3 0.1 0.8 0.3
Table 3: Conditional probability distribution tables. In (a) the distribution of , in (b) conditional distribution of , given , in (c) conditional distribution of , given .
p(=1,=1) p(=1,=2) p(=1,=3)
1 0.5 0.1 0.6
2 0.4 0.1 0.1
3 0.1 0.8 0.3

p(=2,=1) p(=2,=2) p(=2,=3)
1 0.2 0.4 0.8
2 0.2 0.3 0.1
3 0.6 0.3 0.1

p(=3,=1) p(=3,=2) p(=3,=3)
1 0.1 0.5 0.2
2 0.2 0.4 0.6
3 0.7 0.1 0.2
Table 4: Conditional probability distribution table of , given & .