# Inference in Graded Bayesian Networks

Machine learning provides algorithms that can learn from data and make inferences or predictions on data. Bayesian networks are a class of graphical models that allow to represent a collection of random variables and their condititional dependencies by directed acyclic graphs. In this paper, an inference algorithm for the hidden random variables of a Bayesian network is given by using the tropicalization of the marginal distribution of the observed variables. By restricting the topological structure to graded networks, an inference algorithm for graded Bayesian networks will be established that evaluates the hidden random variables rank by rank and in this way yields the most probable states of the hidden variables. This algorithm can be viewed as a generalized version of the Viterbi algorithm for graded Bayesian networks.

• 1 publication
• 6 publications
06/03/2022

### Structure Learning for Hybrid Bayesian Networks

Bayesian networks have been used as a mechanism to represent the joint d...
12/16/2021

### Marginalization in Bayesian Networks: Integrating Exact and Approximate Inference

Bayesian Networks are probabilistic graphical models that can compactly ...
11/20/2012

### A Traveling Salesman Learns Bayesian Networks

Structure learning of Bayesian networks is an important problem that ari...
03/13/2013

### A Probabilistic Network of Predicates

Bayesian networks are directed acyclic graphs representing independence ...
04/29/2010

### Designing neural networks that process mean values of random variables

We introduce a class of neural networks derived from probabilistic model...
01/30/2013

### On the Geometry of Bayesian Graphical Models with Hidden Variables

In this paper we investigate the geometry of the likelihood of the unkno...
08/31/2015

### Learning Structures of Bayesian Networks for Variable Groups

Bayesian networks, and especially their structures, are powerful tools f...

## 1 Introduction

A Bayesian network is a statistical model which provides a graphical representation of probabilistic relationships between several random variables in the form of a directed acyclic graph. Such a network gives an efficient representation of the joint probability distribution by using the conditional dependencies between the variables. These networks were introduced by Judea Pearl in the 1980s

[13]

and attracted a great deal of attention in research and industry since the 1990s. Today, Bayesian networks are widespread in artificial intelligence, knowledge engineering, and machine learning

[1, 12, 14, 16].

In machine learning, statistical inference refers to the discovery of hidden states given observed states. In Bayesian networks, statistical inference concerns the finding of the most probable states of the hidden variables given the observed variables. Statistical inference can be utilized to answer probabilistic queries about hidden variables given instantiations of the observed variables, such as diagnosis , prediction , and classification . However, this problem is NP-hard in the number of hidden variables [3].

In practical terms, the approximation algorithms either make topological structural constraints such as in naive Bayesian networks [9]

or restrictions on the conditional probabilities such as in the powerful bounded variance algorithm

[4]. The most popular approximation algorithms for calculating marginal distributions are based on sum-product message passing or belief propagation [1, 12, 14, 18, 19]. Other inference algorithms make use of importance sampling [5]

or Markov chain Monte Carlo simulation

[10].

In this paper, an inference algorithm for the hidden random variables of a Bayesian network is given by using the tropicalization of the marginal distribution of the observed variables. By restricting the topological structure to graded networks, an inference algorithm for graded Bayesian networks will be derived which evaluates the hidden random variables rank by rank and in this way yields the most probable states of the hidden variables.

## 2 Bayesian Networks

A Bayesian network is a probabilistic graphical model which represents a collection of random variables and their conditional dependencies in form of a directed acyclic graph (DAG).

Let be random variables with state sets

, respectively. Then the random vector

has the state set . By using conditional probabilities, the joint probability distribution of the random variables factors as follows,

 pX=pX1,…,Xn=n∏i=1pXi|Xi+1,…,Xn. (1)

If a DAG models the causal relationships between the random variables, factorizations of this kind can often be simplified since some random variables may be conditionally independent of other ones.

For this, let be a DAG whose node set corresponds one-to-one with a collection of the random variables . Write for the parent set of the random variable in the DAG, . A topological sorting of a directed graph is a linear ordering of the vertices in which the vertices of every directed edge are ordered such that comes before in the linear ordering. Such an ordering is possible if and only if the graph is a DAG. Thus by topological sorting, there is an ordering of the nodes such that for each , the parent set of the node is a subset of . For a finite node set , the vertices of a DAG can be sorted topologically in steps [7].

A Bayesian network is a pair consisting of a DAG with node set for some integer , which corresponds one-to-one with a collection of random variables

, and a collection of conditional probability distributions

of the random variables such that the following holds:

• For each node , which has no parent, there is a probability distribution of the random variable .

• For each node , which has a non-empty parent set , there is a conditional probability distribution .

• The joint probability function factors using the conditional probability distribution functions as follows,

 pX1,…,Xn=n∏i=1pXi|Π(Xi). (2)

The shape of factorization follows the Markov property which states that each random variable depends directly only on its parents.

• Consider the Bayesian network with the random variables in Fig. 1. The parent sets are , , , and . The joint probability function factors as follows,

 pX1,X2,X3,X4 = pX1pX2|X1pX3|X2,X1pX4|X3,X2,X1 = pX1pX2|X1pX3|X1pX4|X3,X2.

• A Bayesian network for printer troubleshooting adapted from the operating system Microsoft Windows 95 has 24 variables as shown in Fig. 2 [8]. Suppose all random variables have binary state sets. Then the joint probability distribution has entries. However, as a Bayesian network, the number of conditional distributions that need to be specified is only .

## 3 Inference Algorithm

In this section, we present an inference algorithm for the hidden random variables of a Bayesian network by using the tropicalization of the marginal distribution of the observed variables. By restricting the topological structure to graded networks, an inference algorithm for graded Bayesian networks will be obtained that evaluates the hidden random variables rank by rank and in this way yields the most probable states of the hidden variables. This algorithm can be viewed as a generalized version of the Viterbi algorithm for graded Bayesian networks.

For this, let be a Bayesian network given by the DAG and the global probability distribution as defined in (2). The node set with is assumed to correspond one-to-one with a collection of random variables denoted by . Suppose the variables are observed or instantiated and the variables are unobserved or hidden. We may assume that the collection of random variables is sorted topologically such that and . Note that according to this sorting, the variable has only parents in the hidden variables and the variable has only parents in the observed variables. Let the variables and have finite state sets and , respectively. Then the random vectors and have state sets and , respectively. Thus the global probability distribution factors as follows,

 pX,Y=m∏i=1pXi|Π(Xi)n∏j=1pYj|Π(Yj). (4)

The probability of the observed sequence is given by the marginal distribution

 pX(x)=∑y∈YpX,Y(x,y). (5)

The variables in the DAG can be equipped with a semi-ranking function from to . For this, each variable with empty parent set or parent set in the observed variables is given the semi-rank . Since the graph is a DAG, there is at least one variable with semi-rank 0. Moreover, each hidden variable whose parents have already assigned semi-ranks is given the semi-rank

 ρ(Z)=max{ρ(U)∣U hidden and parent of Z}+1. (6)

Furthermore, each observed variable is given the largest semi-rank of its hidden parents,

 ρ(Z)=max{ρ(U)∣U hidden and parent of Z}. (7)

The reason is that the conditional probability of observed variable with given value can be evaluated as soon as the parents are instantiated (Ex. 3).

Let denote the maximal semi-rank of the nodes in the DAG and let and denote the collections of observed and hidden random variables with semi-rank , , respectively. Then we have and . Moreover, the state set of the hidden variables with semi-rank is denoted by

 D(r)=Y(r)1×…×Y(r)tr,0≤r≤ρmax. (8)

Then by the semi-ranks of the nodes, the marginal distribution (5) can be written according to the following sum-product decomposition,

 pX(x) = ⎛⎝∑y∈D(0)s0∏i=1pX(0)i|Π(X(0)i)t0∏j=1pY(0)j|Π(Y(0)j) ⋅⎛⎝∑y∈D(1)s1∏i=1pX(1)i|Π(X(1)i)t1∏j=1pY(1)j|Π(Y(1)j) … ⋅⎛⎝∑y∈D(ρ)sρ∏i=1pX(ρ)i|Π(X(ρ)i)tρ∏j=1pY(ρ)j|Π(Y(ρ)j)⎞⎠…⎞⎠,

where and the arguments of the conditional probabilities have been omitted for readability. This decomposition is sound, since the computation in the -th bracket corresponding to the collections of variables and of semi-rank depends on the parent nodes which are of lower semi-rank.

• Consider the Bayesian network given by the DAG in Fig. 3. Take the topological sorting . The random variables have semi-ranks , , , and . In view of the DAG, the joint probability distribution factors as follows,

 pX,Y=pX1pY1|X1pY2|Y1pY3|Y1pY4|Y2pY5|Y3,Y4. (10)

The marginal distribution of the observed value can be written as follows,

 pX1(x1) = ∑(y1,…,y5)∈YpX,Y(x1,y1,…,y5) = pX1(x1)⋅⎛⎝∑y1∈Y1pY1|X1(y1|x1) ⋅⎛⎝∑(y2,y3)∈Y2×Y3pY2|Y1(y2|y1)pY3|Y1(y3|y1) ⋅⎛⎝∑y4∈Y4pY4|Y2(y4|y2) ⋅⎛⎝∑y5∈Y5pY5|Y3,Y4(y5|y3,y4)⎞⎠…⎞⎠.

• Consider the Bayesian network given by the DAG in Fig. 4. We have and , and the random variables have the semi-ranks , , and . In view of the DAG, the joint probability distribution factors as follows,

 pX,Y=pX1pX2|X1pY1|X1pY2|X2,Y1pY3|Y1pX3|Y2pY4|Y2,Y3. (12)

The marginal distribution of the observed sequence can be decomposed as follows,

 pX(x1,x2,x3) = ∑(y1,y2,y3,y4)∈YpX,Y(x1,x2,x3,y1,y2,y3,y4) = pX1(x1)pX2|X1(x2|x1)⋅⎛⎝∑y1∈Y1pY1|X1(y1|x1) ⋅⎛⎝∑(y2,y3)∈Y2×Y3pY2|X2,Y1(y2|x2,y1)pY3|Y1(y3|y1)pX3|Y2(x3|y2) ⋅⎛⎝∑y4∈Y4pY4|Y2,Y3(y4|y2,y3)⎞⎠…⎞⎠.

The marginal distribution of the observed random variables can be used for the probabilistic inference of the hidden random variables, which amounts to finding the most probable state sequences of the hidden variables. This can be achieved by tropicalization of the marginal distribution of the observed variables. For this, we introduce the tropical semiring [11].

A semiring is an algebraic structure similar to a ring, but without the requirement that each element must have an additive inverse. More specifically, a semiring is a non-empty set together with two binary operations, called addition and multiplication , such that is a commutative monoid with identity element 0, is a monoid with identity element 1, the multiplication distributes over addition, i.e., for all ,

 a⋅(b+c)=(a⋅b)+(a⋅c)and(a+b)⋅c=(a⋅c)+(b⋅c), (14)

and the multiplication with 0 annihilates , i.e., for all , .

A semiring is commutative if its multiplication is commutative, i.e., for all , . A semiring is idempotent if its addition is idempotent, i.e., for all , .

For instance, each ring is also a semiring. Moreover, the set of natural numbers forms a commutative semiring with the ordinary addition and multiplication. Likewise, the set of non-negative real numbers forms a commutative semiring.

The set together with the operations

 x⊕y=min{x,y}andx⊙y=x+y,x,y∈R∪{∞}, (15)

with for all forms an idempotent commutative semiring with additive identity and multiplicative identity 0. Note that additive and multiplicative inverses may not exist in a semiring. For instance, the equations and have no solutions . This semiring is known as tropical semiring. The attribute ”tropical” was coined by French scholars (1998) in honor of the Brazilian mathematician Imre Simon who studied the tropical semiring in the early 1960s.

The mapping is bijective and monotonically decreasing with , , and

 ϕ(x⋅y)=ϕ(x)⊙ϕ(y),x,y∈R≥0. (16)

The mapping is the tropicalization of the ordinary semiring . In this way, large values (probabilities) are mapped to small values (weights) and vice versa.

Given an observed sequence , the objective is to find one (or all) sequences with maximum likelihood

 pY|X(y∣x)=pX,Y(x,y)pX(x). (17)

Since the observed sequence is fixed, the likelihood is directly proportional to the joint probability provided that . Suppose that . Then the aim is to find one (or all) sequences with the property

 ¯y = \rm argmaxy∈Y{pX,Y(x,y)}. (18)

Each optimal sequence is called an explanation of the given sequence . The explanations can be found by tropicalization. For this, put and for all and . Then the tropicalization yields

 wX(x)=⨁y∈YwX,Y(x,y). (19)

The explanations can be obtained by evaluation in the tropical semiring,

 ¯y = \rm argminy∈Y{wX,Y(x,y)}. (20)

The value can be computed by tropicalizing the sum-product decomposition of the marginal probability . For this, we put for each random variable . Thus if in the sum-product decomposition (3) sums are replaced by tropical sums and products by tropical products, we obtain

 wX(x) = ⎛⎝⨁y∈D(0)s0⨀i=1wX(0)i|Π(X(0)i)⊙t0⨀j=1wY(0)j|Π(Y(0)j) ⊙⎛⎝⨁y∈D(1)s1⨀i=1wX(1)i|Π(X(1)i)⊙t1⨀j=1wY(1)j|Π(Y(1)j) … ⊙⎛⎝⨁y∈D(ρ)sρ⨀i=1wX(ρ)i|Π(X(ρ)i)⊙tρ⨀j=1wY(ρ)j|Π(Y(ρ)j)⎞⎠…⎞⎠,

where and the arguments of the weights have been omitted for readability. This yields the following result.

###### Proposition 3.1.

Let . The tropicalization of the marginal probability provides the explanations of the sequence .

However, the tropicalization of the marginal probability does not overcome the NP-hardness of the inference problem. Our aim is to provide an easily structured inference algorithm for a class of topologically constrained Bayesian networks which emerge quite naturally in practice.

For this, a DAG is called graded if it can be equipped with a rank function from to . A rank function of a DAG must be compatible with the given topological ordering and the rank must be consistent with the covering relation of the ordering [17]. In our case, each variable with empty parent set or parent set in the observed variables is given the rank . Moreover, each hidden variable is assigned the rank if all its hidden parents have rank . Furthermore, each observed variable is assigned the rank if all its hidden parents have rank . For instance, the DAG in Fig. 4 is graded, while the DAG in Fig. 3 is not. A Bayesian network is graded if its underlying DAG is graded.

Inference in a graded Bayesian network has the advantage that in the computation of the -th expression

 ⨁y∈D(r)sr⨀i=1wX(r)i|Π(X(r)i)⊙tr⨀j=1wY(r)j|Π(Y(r)j),1≤r≤ρmax, (22)

the terms and depend only on the parent values of the hidden variables of previous rank . In this way, the evalution of the expression has a simple bookkeeping structure (Alg. 1). By the gradedness of the nodes, the hidden parents of each hidden variable with rank all have rank and so the computation of array element with rank and requires only the array elements with of the previous rank.

The algorithm follows the principle of dynamic programming [2] and consists of a forward algorithm evaluating the tropicalized expression and a backward algorithm which provides one or all explanations of the collection of hidden variables. The latter is achieved by recording in each step all state values in which attain the minimum in the minimization step, . This information can already be recorded by the forward algorithm. Then the trace back of all optimal decisions made in each step can provide all explanations. The forward algorithm evaluates the expression (3) by using an array such that the array entries with record all decisions made up to the variables of rank .

The complexity of the evaluation of the tropicalized term depends on the underlying DAG. The array has size and the computation of array element requires steps. Suppose all state sets have elements. Then we have for all .

In the best case, the hidden random variables all have the same rank and common observed ascendants. Then . In view of the graded DAG in Fig. 5, the random variables have ranks . Since the minimization is decoupled, the inference algorithm has time complexity and computes for each observed value the following,

 wX1(x1) = miny1,…,yn(wY1|X1(y1|x1)+…+wYn|X1(yn|x1)) = miny1(wY1|X1(y1|x1))+…+minyn(wYn|X1(yn|x1)).

In the hidden Markov model (HMM) the hidden random variables form a chain. Here

and for each . In view of the graded DAG in Fig. 6, the random variables have ranks for . The Viterbi algorithm [15, 11, 20] calculates for each observed sequence the following,

 A[0,y] := wX1(x1)+wY1(y), A[1,y] := miny1(wY2|Y1(y|y1)+wX2|Y2(x2|y)+A[0,y1]) (24) … A[n−1,y] := minyn−1(wYn|Yn−1(y|yn−1)+wXn|Yn(xn|y)+A[n−2,yn−1]) wX(x) := minynA[n−1,yn].

The array has size and the computation of each array element requires steps. Hence, the time complexity is . Note that the Bayesian networks for the hidden tree Markov model [6] and stochastic automata [21] are graded as well and their inference algorithms have both the same time complexity .

In the worst case, the hidden random variables have the same rank and common observed descendants. Then . In view of the graded DAG in Fig. 7, the random variables have ranks . Since the minimization is fully coupled, the inference algorithm has time complexity and computes for each observed value the following,

 wX1(x1)=miny1,…,yn(wX1|Y1,…,Yn(x1|y1,…,yn)+n∑i=1wYi(yi)). (25)
• In view of the Bayesian network in Ex. 3, the tropicalization of the marginal distribution gives

 wX(x1,x2,x3) = ⨁(y1,y2,y3,y4)∈YwX,Y(x1,x2,x3,y1,y2,y3,y4) = wX1(x1)⊙wX2|X1(x2|x1)⊙⎛⎝⨁y1∈Y1wY1|X1(y1|x1) ⊙⎛⎝⨁(y2,y3)∈Y2×Y3wY2|X2,Y1(y2|x2,y1)⊙wY3|Y1(y3|y1)⊙wX3|Y2(x3|y2) ⊙⎛⎝⨁y4∈Y4wY4|Y2,Y3(y4|y2,y3)⎞⎠…⎞⎠.

Assume that the hidden variables have common state set . Then we have , , and . The forward inference algorithm computes the following:

 A[0,a] = wY1|X1(a|x1)+wX1(x1)+wX2|X1(x2|x1), A[0,b] = wY1|X1(b|x1)+wX1(x1)+wX2|X1(x2|x1), A[1,aa] = miny1∈D(0)(A[0,y1]+wY2|X2,Y1(a|x2,y1)+wY3|Y1(a|y1)+wX3|Y2(x3|a)), A[1,ab] = miny1∈D(0)(A[0,y1]+wY2|X2,Y1(a|x2,y1)+wY3|Y1(b|y1)+wX3|Y2(x3|a)), A[1,ba] = miny1∈D(0)(A[0,y1]+wY2|X2,Y1(b|x2,y1)+wY3|Y1(a|y1)+wX3|Y2(x3|b)), A[1,bb] = miny1∈D(0)(A[0,y1]+wY2|X2,Y1(b|x2,y1)+wY3|Y1(b|y1)+wX3|Y2(x3|b)), A[2,a] = miny2y3∈D(1)(A[1,y2y3]+wY4|X2,Y3(a|x2,y3)), A[2,b] = miny2y3∈D(1)(A[1,y2y3]+wY4|X2,Y3(b|x2,y3)).

Then we have .

Finally, note that in a Bayesian network with a non-graded structure the inference algorithm given by the evaluation of the expression has generally a more complex bookkeeping structure for the computation of the expression (22), since it requires to resort on values of hidden variables with arbitrarily small semi-rank (Fig. 3). The corresponding data structure (array) holding these values will be rather intricate and meander-shaped.

## References

• [1] D. Barber, Bayes Reasoning and Machine Learning, Cambridge University Press, Cambridge (2012).
• [2] R. E.. Bellman, Dynamic Programming, Dover Publications, Mineola NY (2003).
• [3] G. F. Cooper, The computational complexity of probabilistic inference using Bayesian belief networks, Artificial Intelligence, 42 (1990), 393-405. http://dx.doi.org/10.1016/0004-3702(90)90060-D
• [4]

P. Dagum, M. Luby, An optimal approximation algorithm for Bayesian inference,

Artificial Intelligence, 93, No. 1-2, (1997), 1-27. http://dx.doi.org/10.1016/S0004-3702(97)00013-1
• [5] A. Doucet, N. de Freitas, N. Gordan, Sequential Monte Carlo Methods in Practice, Springer, New York (2001).
• [6]

J. Felsenstein, Maximum likelihood and minimum-steps methods for estimating evolutionary trees from data on discrete characters,

Systematic Biology, 22, No. 3 (1973), 240-249. http://dx.doi.org/10.1093/sysbio/22.3.240
• [7] A. B. Kahn, Topological sorting of large networks, Communications of the ACM, 5, No. 11 (1962), 558-562. http://dx.doi.org/10.1145/368996.369025
• [8] D. Heckermann, A tutorial on learning with Bayesian networks, Microsoft Research, Technical Report MSR-TR-95-06 (1995). http://dx.doi.org/10.1007/978-3-540-85006-3_3
• [9]

I. Rish, An empirical study of the naive bayes classifier,

IJCAI Workshop on Empirical Methods in AI. (2001). http://dx.doi:10.1.330.2788
• [10] T. Müller-Gronbach, E. Novak, K. Ritter, Monte-Carlo-Algorithmen, Springer, Berlin (2012).
• [11] L. Pachter, B. Sturmfels, Algebraic Statistics for Computational Biology, Cambridge University Press, Cambridge (2005).
• [12] T. Koski, J. M. Noble, Bayesian Networks, Wiley, New York (2009).
• [13] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of Plausable Inference, Morgan Kaufmann, San Francisco, CA (1990).
• [14] J. Pearl, Causality, Cambridge University Press, Cambridge (2000).
• [15] L. R. Rabiner, A tutorial on hidden Markov models and selected applications, Proceedings of the IEEE, 77, No. 2 (1989), 257-286. http://dx.doi.org/10.1109/5.18626
• [16] S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach, Prentice Hall, Englewood Cliffs, NJ (1995).
• [17] R. Stanley, Enumerative Combinatorics, Cambridge University Press, Cambridge, (1997).
• [18] N. Wiberg, Codes and Decoding on General Graphs, Linköping Studies in Science and Technology, Dissertation 440, Linköpings Universitet, Linköping (1996).
• [19] J. S. Yedidia, W. T. Freeman, Y. Weiss, Constructing free-energy approximations and generalized belief propagation algorithm, IEEE Transactions on Information Theory, 51, No. 7 (2005), 2282-2312. http://dx.doi.org/10.1109/TIT.2005.850085
• [20] K.-H. Zimmermann, Algebraic Statistics, TubDok, Hamburg, Germany (2016). http://dx.doi:10.15480/882.1273
• [21] K.-H. Zimmermann, Stochastic Automata, Int. Journal Pure Applied Mathematics, 115, No. 3 (2017), 621-639. http://dx.doi:10.12732/ijpam.v115i3.1