1 Introduction
Bayesian networks are convenient graphical expressions for high dimensional probability distributions representing complex relationships between a large number of random variables. They have been employed extensively in areas such as bioinformatics ([4], [12], [17], [16]), artificial intelligence ([15], [13], [21]), diagnosis ([8], [18], [23]), and risk management ([7], [10], [22]) for the purposes of general modeling, prediction, and diagnosis.
A Bayesian network consists of a directed acyclic graph (DAG), in which nodes represent random variables, and a set of conditional probability distributions. For example, the graph in Figure
1 represents a network for 5 variables, , , , , and with a joint probability density which is assumed to factor asIn general, for a Bayesian network on nodes, the joint density for the corresponding random variables may be written as
(1) 
where is used to denote the set of random variables corresponding to “parent” nodes for node . For the example in Figure 1, we have , , , , and .
Given observations of , our ultimate goal is to infer the connecting arrows for a corresponding node graph. There are many ways to do this rather successfully for discrete data or for Gaussian data. In the case of nonGaussian continuous data, discretization is usually employed, however care must be taken not to destroy the very structure one wishes to retrieve. Since the problem involves mapping entire intervals of values to discrete values, it is closely related to the “aggregation problem” where one reduces the number of possible values taken on by a node by mapping collections of values in already discrete data to single points. (We will think of this as a “discretization of discrete data” and will still refer to it as discretization.)
For an example of a discretization masking important attributes of the underlying network structure, consider the three node DAG show in Figure 2 where the corresponding random variables , , and each take on values in .
Consider the discretization to random variables , , and defined as if and if .
The original network structure implies that the variables represented by nodes and are conditionally independent given the value of the variable at node 1, yet it is easy to verify that we have, after discretization,
for example. Thus, there is little hope of recovering the correct network structure after this discretization with data that have been discretized in this way.
Friedman and Goldszmidt [11] suggest an approach based on the minimum description length principle that chooses a discretization which preserves the information in the original data set. However, their approach, which we describe in Section 3, requires that we compute and compare scores for every different possible discretization of the data. The number of possibilities is often prohibitively large.
In this paper we give a search strategy in which only a very small subset of minimum description length scores must be computed. Under certain simplifying assumptions, we prove that the procedure will identify the optimal score. Furthermore, in the case where there is a “correct” discretization in the sense that distributions of nodes are unchanged over entire intervals of values for children, we show that the optimal scoring network does in fact correspond to the data generating network.
2 LikelihoodBased Recovery of Bayesian Networks for Discrete Data
We will use this section to establish some notation that will be used throughout this paper and to briefly review commonly used likelihoodbased methods for recovering the structure of Bayesian networks.
Consider a Bayesian network on nodes. For , let denote the random variable associated with node . We will refer to node and the associated random variable , interchangeably. Suppose that can take on values in . Let denote the set of parent node indices for node , let denote the number of parents for node , and let denote the number of possible parent configurations. For example, for the network in in Figure 1,
For , , and , define parameters
Here, “” refers to a specific set of values for the parent variables, configurations of which have been enumerated and are now indexed by .
Now, from the joint density (1), the likelihood for the entire dimensional data set may be written as
where is the total number of times in the sample that is observed to have value when its parents take on configuration . (To keep this concise expression for the likelihood, we define, for a parentless node , and .)
There are many ways to recover networks from data. Indeed, we may not even want to think in terms of “one best network” and instead use a model averaging approach or one that constructs a best network by combining best “features” (for example edges) from several networks. In this paper, we will restrict our attention to simple methods for recovering a single “best” network as measured by various standard likelihood and information criterion indices. We will assume a manageable number of networks to score. Of course, the number of possible networks increases superexponentially in the number of nodes– the results of this paper might then be applied using Monte Carlo search strategies.

LogLikelihood (LL)
Given tuples of data points, we compute the loglikelihood for every possible graph. For each DAG, we have a different set of relevant
parameters. Given a particular DAG, we estimate each
with its maximum likelihood estimatorand then we report the loglikelihood
As we can always increase the loglikelihood by including additional parameters, we expect the greatest loglikelihoods (“most likely models”) to coincide with DAGs with a maximal number of edges. Therefore, it is important include a penalty for overparameterized models. The two most common penalized loglikelihood statistics are given by the following information criteria.

Akaike’s Information Criterion (AIC)
Akaike’s Information Criterion (AIC) is, in its most general form, defined by
Clearly, the goal is to minimize the AIC to ensure a good fitting model in the sense of maximizing the loglikelihood while penalizing for having too many parameters.

Bayesian Information Criterion (BIC)
The Bayesian Information Criterion (BIC) is defined by
where is, as before, the sample size. As with the AIC, the goal is to minimize the BIC.
Both the AIC and BIC have rigorous justifications, from both Bayesian and frequentist points of view. AIC was derived from information theory though it can be though of as Bayesian if one uses a clever choice of prior. On the other hand BIC, originally derived through Bayesian statistics as a measure of the Bayes factor, can also be derived as a nonBayesian result. For more information on these widely used scoring criteria, we refer the interested reader to
[1],[2], [3], and [5] (AIC), and [14] and [20] (BIC). To make some broad generalizations, AIC, it can often overfit the model in terms of number of parameters. BIC, on the other hand, tends to overpenalize, or underfit the model.That being said, we have been most successful in recovering the correct structure of simulated Bayesian networks using AIC. Incidentally, the use of Bayesian networks does not necessarily imply a Bayesian approach to modeling where parameters are considered as random variables. In this paper, we are taking the frequentist approach.3 Minimum Description Length
Arguably, the point of statistical modeling is to find regularities in an observed data set. Discovered regularities allow the modeler to be able to describe the data more succinctly. The minimum description length (MDL) principle, introduced in 1978 by Jorma Rissanen [19], is a model selection technique that chooses, as the best model, the one that permits the shortest encoding of both the model and the observed data.
In this section, we describe how Friedman and Goldszmidt [11] define a description length that can be used for network recovery from discrete data. It is the approximate length of storage space, measured in bits, for the binary representation of the DAG (network structure), the parameters (the ) that, together with the DAG, define a Bayesian network, and the data itself. We will see, in the end, that it is simply another penalized likelihood approach.
In what follows, we repeatedly use the fact that an integer can be encoded in approximately bits. For example, the decimal value becomes, in binary, the bit number 1001. Throughout this paper, we will use to denote the base logarithm, though, in the end the base is unimportant for the comparisons we will make.
Encoding the DAG
As in Section 2, we assume that , can take on possible values, and that they are integers ranging from to . (For example, if can take on values in , we would relabel the values as , , and .)
The approximate number of bits needed to encode the number of nodes/variables , and the number of possible values taken on by each of those variables is
(2) 
In order to completely describe the network structure, we must also include the number of parents for each node and the actual list of parents for each node. As a simplification, since the number of parents of node , which is denote by , is always less than , we can, conservatively, reserve bits to encode . Doing this for each node, we add
(3) 
to our network description length.
For the actual list of parents, since the maximum value in the list of indices is , we will use the conservative value of bits to encode each index. For node , we must encode different indices, each using a length of bits of space. So, in total, to encode all parent lists, we will use
(4) 
bits.
In total, we use
(5) 
bits to encode the DAG structure.
Encoding the Parameters
The Bayesian network consists of a DAG together with a collection of parameters . Since , we only need to encode parameters for node . However, the parameters are not integer valued. In our case, for network recovery, we will actually be storing/encoding parameters that have been estimated from our dimensional data points. Friedman and Goldszmidt [11] indicate that the “usual choice in the literature” is to use bits per parameter. Thus, we will use
(6) 
bits to encode the estimated network parameters.
We add (5) and (6) in order to define description length for the Bayesian network. Since we are trying to infer the best connecting arrows for a graph on nodes, we presumably already know the number of nodes and would not be encoding it. Thus, we drop the term and define the network description length as
(7) 
Encoding the Data
Consider the string of digits stored, in binary as . (The binary representations of , , and are , , and , respectively.) Without separators, the binary string can not be decoded near the end. The maybe be, in decimal, three ’s, a followed by a , or a followed by a . “Prefix codes” avoid this problem by encoding the digits with zeros and ones in a way so that no encoded digit is a prefix of any other encoded digit. Further consideration can be made to ensure the code is not only a prefix code, but one that results in the maximum compression of the data by assigning, to the original decimal digits, binary codes of lengths inversely proportional to the frequencies with which they appear in the original data.
In the context of Bayesian networks, we wish to encode the values in the dimensional data set with code lengths that are inversely proportional to the frequencies (equivalently, estimated probabilities) with which they appear in the data. Friedman and Goldszmidt [11] use Shannon coding [6] which is not optimal in terms of compression, but which encodes each dimensional data point using approximately bits where is the joint density for evaluated at . Thus, the entire data set, consisting of
such vectors is encoded in approximately
(8) 
bits. This is desirable from a modeling standpoint since it corresponds to the familiar loglikelihood commonly used in statistical inference.
Summing (7) and (8), we define the description length for a Bayesian network and the observed data as
(9) 
Given discrete data and a collection of possible Bayesian networks, the network chosen by the MDL principle is the one that minimizes (9). In light of the form of , we see that this is simply a penalized loglikelihood scoring metric, similar to the AIC and BIC discussed in Section 2. We now illustrate the performance of all three for a three node network. The list of all 25 DAGs corresponding to 3 node networks can be found in Table 1 and we will refer to these DAGs as they are numbered here.
1  2  3  4  5 

6  7  8  9  10 
11  12  13  14  15 
16  17  18  19  20 
21  22  23  24  25 
We fixed arbitrary parameters for DAG 8 in Table 1 and simulated values of . (Each was assumed to take on values in .)
Results are shown in Table 2 along with the previously computed values of AIC and BIC. The AIC, BIC, and MDL scores all recovered the correct network “up to Markov equivalence”.^{†}^{†}A set of DAGs with the same set of edges are said to be Markov equivalent if they share the same set of conditional independence relations among variables. Recovering the specific graph within a Markov equivalence class (a causality problem) requires experimental as opposed to simply observed data and is not the subject of this paper.
Graph  AIC  BIC  MDL 

Number  
1  1040886.085  1041057.317  520443.714 
2  1001501.922  1001958.543  500885.425 
3  1001501.961  1001958.543  500885.425 
4  991255.256  991711.879  495762.093 
5  991255.259  991711.879  495762.093 
6  1036182.434  1036639.054  518225.681 
7  1036182.434  1036639.054  518225.681 
8  951871.096  952613.104  476213.804 
9  951871.096  952613.104  476213.804 
10  951871.096  952613.104  476213.804 
11  996798.271  997540.279  498677.392 
12  996798.271  997540.279  498677.392 
13  996798.271  997540.279  498677.392 
14  991435.745  936604.692  496536.904 
15  986551.608  987293.616  493554.060 
16  986551.608  987293.616  493554.060 
17  986551.608  987293.616  493554.060 
18  956755.233  958924.180  479196.648 
19  952051.582  954505.917  476988.615 
20  952051.582  954505.917  476988.615 
21  952051.582  954505.917  476988.615 
22  952051.582  954505.917  476988.615 
23  952051.582  954505.917  476988.615 
24  952051.582  954505.917  476988.615 
25  952051.582  954505.917  476988.615 
4 Minimum Description Length for Discretization
A discretization of data for the random variable , represented by node , is a mapping from the range of values in the data set to the set for some . It can be described by ordering the distinct values observed for and inserting up to “thresholds”. For example
(10) 
In this paper, we will assume that only node needs to be discretized and that the remaining nodes are discrete or have already been discretized. We refer the reader to Friedman and Goldszmidt [11] for a discussion of multiple node discretization which essentially involves fixing discretizations for all but one node, discretizing that node, and repeating the process by cycling through all nodes repeatedly.
As before, we assume that we have observations of the dimensional . Let be the number of distinct values taken on by in the data set. Note that , with equality possible only for truly continuous data. Define as the discretized version of . To discretize to values, we need to choose thresholds to put in spaces between ordered values. There are threshold configurations to consider. Since we will not know in advance how many values we should have in the discretized data set, we need to consider everything from , which corresponds to mapping all values for in the data set to the single value of 1, to , which corresponds to no discretization at all. In total, there are
discretizations to consider.
Friedman and Goldszmidt [11] define a description length score for a network with discretized into a particular configuration of thresholds using essentially four terms. These terms include and , previously described in (7) and (8), computed now after discretization– we will call these terms and . Also included are terms that encode the index denoting a particular discretization policy and description length for information needed to recover the original data from the discretized data.
Encoding the Discretization Policy
For fixed and , there are different possible configurations for thresholds. Assume we have labeled them from to . Storing the index for a particular policy will take at most bits. Friedman and Goldszmidt use a conservative upper bound based on the inequality
where
and conservatively reserve
(11) 
to encode the discretization policy. (In practice, we define .)
Encoding Recovery of Original Data
Consider again the example of 9 () distinct values for given by (10) from a data set with values. Every time we assign a discretized value to an original value, we should store the original value for recovery. Instead, however, we will store the Shannon binary code for the original value using estimated conditional probabilities based on the entire data set. For example, the value might appear in the entire data set onefourth of the time. That is, . Among the instances of , , and , it might appear half of the time. That is, . Given a discretized value of , we will encode the original value, , using approximately bits. In total, for recovering original data from discretized data, we add
(12) 
to the description length.
In summary, Friedman and Goldszmidt [11] define the description length discretization score as
(13) 
Given a data set and a particular network structure, one scores various discretization mappings for and chooses the discretization that minimizes (13). For a given network, the score in (13) will change over discretizations only in terms directly linked to the th node. Thus, as Friedman and Goldszmidt point out, we only need to consider the “local description length score” defined, by picking out relevant terms, as
(14) 
Here, is a the set of parents for node , denoted with an asterisk since it includes the discretized , and,
(15) 
is the estimated mutual information between random vectors and . ( is defined to be zero.)
5 Minimizing the Local Description Length Score
Computation of a single value of (14), which is based on network structure through the information terms, can be quite time consuming. For even moderately sized , computation of (14) repeatedly to check all discretization policies can be prohibitive, and when multiple/all nodes need to be discretized (a procedure that involves several passes through each node as discussed in [11]), computation of (14) becomes almost impossible. Friedman and Goldszmidt [11] give some further computational simplifications and suggest a greedy search routine.
The purpose of this paper is to provide an alternative efficient search strategy for the smallest score. We assert that, for a single node discretization, one need only check values of as opposed to . Multiple continuous nodes can then be cycled for discretization just as Friedman and Goldszmidt have suggested.
For notational simplicity, we assume, for the remainder of this paper, that the node to be discretized is labeled as node . Also, as we are comparing values of for various discretizations of , we will drop all asterisk superscript notation, as it is understood that we are considering discretized values.
Single Threshold TopDown Search Strategy
Let be (14) with all thresholds in place. For , let be (14) with all thresholds except for the th threshold.
In order to minimize over all discretization policies for , make comparisons of with for . If , remove the th threshold.
We first consider the effectiveness of this search strategy in the very ideal situation where we augment given discrete data by introducing superfluous values for node 1 in a larger discrete set. For example, we might replace values of in the original data set with values in with some arbitrary probabilities, whereupon we hope that our search strategy will minimize and that the configuration of thresholds that does such will correctly map values in back to the original value of . We will prove, in this case, that the “single threshold topdown” search strategy will find the discretization policy that minimizes among all discretizations for a large enough sample size .
Indeed, , , , and are constant, and . Note that
(16) 
will be constant over any discretization policy for with a fixed number of thresholds. In particular, when comparing all possible single threshold removals, starting with any fixed number of thresholds, we can restrict our attention to maximizing
(17) 
In the next Section, we consider what our search strategy does to (17) in the absence of (16) in an ideal situation where there is a “correct” discretization. We will see that it maximizes (17) and that it does so while leaving a minimal number of thresholds. In Section 5.2, we will consider (16) and conclude that we are indeed minimizing . We will also see that it finds the correct discretization.
5.1 A Closer look at Information Terms in an Ideal Situation
As the notation to follow gets a bit cumbersome in the general case, we illustrate the proof of most claims in this Section and the next with a concrete example. Consider a twonode network where node is a parent to node , and assume that that both nodes take on values in . From data points, we can produce estimates
for .
In general, for this two node network, we would assume that node 1 takes on values in , node 2 takes on values in , and one would work with the estimates , , and for and .
We now “explode” the data for our specific example at node into values in by replacing instances of with values in with probabilities and , respectively, replacing original instances of with values in with probabilities , , and , respectively, and replacing original instances of with the value . Thus, in our “exploded data set”, node is taking on values in with probabilities denoted as for , where, for example, and . Joint probabilities for and are denoted by where we have, for example,
In the more general case, we could “explode” the data at node more generally by replacing instances of with values in some probabilities , summing to 1, replacing original instances of with values in with respective probabilities , summing to 1, and so forth.
By “exploding” the data at node one, we have introduced superfluous values for in terms of probabilities for . For example, for all . Any proper discretization process should aggregate the values and for back into one value. Here, is used to denote the probability . Similarly, may be denoted as .
For this twonode network, (17) is simply , which we will denote by . Define this estimated information as
(18) 
where
(19) 
and the sums run over and .
Define the corresponding information and information terms using in place of with sums running over and .
It is easy to verify information term relationships such as
and consequently that
That is, we did not change the information between and by exploding the data at .
We now show that our single threshold topdown search strategy will correctly recover the original values for . We call this the “correct discretization”, denoted as , which means that we will remove three thresholds from the “full discretization”, denoted as , and that the values and will map back to , the values in will map back to , and the value , will map back to . We will assume that the original values in are all distinct in the sense that for some , for some , and for some , so that they should not be aggregated further.
Removing a Threshold: The Impact on Information
Starting with the exploded data, with values for represented as , we consider the information term after removing the threshold between the values and for some . Note that removal of this th threshold will leave all values below the threshold unchanged, while all values above will be decreased by 1. For example, if we remove the third threshold, we denote the new configuration as , but it represents a mapping
Define, for and , the joint and marginal probabilities for and after the th threshold is removed as , , and . We have, for , the relationships
and
Defining and analogous to (18) and (19), using , we have, for ,
The inequality is is due to the logsum inequality,
which holds for any nonnegative and . The logsum inequality can be shown to be an equality if and only if the are equal for all . Thus, we have that
(20) 
for , with equality if and only if
This happens if and only if
(21) 
for , which is precisely when the values and in the exploded version of should be aggregated or discretized into one value. If we do not have (21), aggregating the values will result in a loss of information.
Note that denotes the value of the information between and with the th threshold in the explosion for removed. In our example,
and
Thus, we have (21), when , and consequently
On the other hand, considering the second threshold removal from the full discretization , similar calculations show that
which is not, in general, equal to
so we have the strict inequality and therefore
Comments
There are no comments yet.