    # An Efficient Search Strategy for Aggregation and Discretization of Attributes of Bayesian Networks Using Minimum Description Length

Bayesian networks are convenient graphical expressions for high dimensional probability distributions representing complex relationships between a large number of random variables. They have been employed extensively in areas such as bioinformatics, artificial intelligence, diagnosis, and risk management. The recovery of the structure of a network from data is of prime importance for the purposes of modeling, analysis, and prediction. Most recovery algorithms in the literature assume either discrete of continuous but Gaussian data. For general continuous data, discretization is usually employed but often destroys the very structure one is out to recover. Friedman and Goldszmidt suggest an approach based on the minimum description length principle that chooses a discretization which preserves the information in the original data set, however it is one which is difficult, if not impossible, to implement for even moderately sized networks. In this paper we provide an extremely efficient search strategy which allows one to use the Friedman and Goldszmidt discretization in practice.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian networks are convenient graphical expressions for high dimensional probability distributions representing complex relationships between a large number of random variables. They have been employed extensively in areas such as bioinformatics (, , , ), artificial intelligence (, , ), diagnosis (, , ), and risk management (, , ) for the purposes of general modeling, prediction, and diagnosis.

A Bayesian network consists of a directed acyclic graph (DAG), in which nodes represent random variables, and a set of conditional probability distributions. For example, the graph in Figure

1 represents a network for 5 variables, , , , , and with a joint probability density which is assumed to factor as

 p(x1,x2,x3,x4,x5)=p(x1)⋅p(x2|x1)⋅p(x3|x1)⋅p(x4|x2,x3)⋅p(x5|x4).

In general, for a Bayesian network on nodes, the joint density for the corresponding random variables may be written as

 p(x1,x2,…,xn)=n∏i=1p(xi|Πi) (1)

where is used to denote the set of random variables corresponding to “parent” nodes for node . For the example in Figure 1, we have , , , , and .

Given observations of , our ultimate goal is to infer the connecting arrows for a corresponding -node graph. There are many ways to do this rather successfully for discrete data or for Gaussian data. In the case of non-Gaussian continuous data, discretization is usually employed, however care must be taken not to destroy the very structure one wishes to retrieve. Since the problem involves mapping entire intervals of values to discrete values, it is closely related to the “aggregation problem” where one reduces the number of possible values taken on by a node by mapping collections of values in already discrete data to single points. (We will think of this as a “discretization of discrete data” and will still refer to it as discretization.)

For an example of a discretization masking important attributes of the underlying network structure, consider the three node DAG show in Figure 2 where the corresponding random variables , , and each take on values in .

Consider the discretization to random variables , , and defined as if and if .

The original network structure implies that the variables represented by nodes and are conditionally independent given the value of the variable at node 1, yet it is easy to verify that we have, after discretization,

 P(X∗2=1,X∗3=1|X∗1=1)≠P(X∗2=1|X∗1=1)⋅P(X∗3=1|X∗1=1),

for example. Thus, there is little hope of recovering the correct network structure after this discretization with data that have been discretized in this way.

Friedman and Goldszmidt  suggest an approach based on the minimum description length principle that chooses a discretization which preserves the information in the original data set. However, their approach, which we describe in Section 3, requires that we compute and compare scores for every different possible discretization of the data. The number of possibilities is often prohibitively large.

In this paper we give a search strategy in which only a very small subset of minimum description length scores must be computed. Under certain simplifying assumptions, we prove that the procedure will identify the optimal score. Furthermore, in the case where there is a “correct” discretization in the sense that distributions of nodes are unchanged over entire intervals of values for children, we show that the optimal scoring network does in fact correspond to the data generating network.

## 2 Likelihood-Based Recovery of Bayesian Networks for Discrete Data

We will use this section to establish some notation that will be used throughout this paper and to briefly review commonly used likelihood-based methods for recovering the structure of Bayesian networks.

Consider a Bayesian network on nodes. For , let denote the random variable associated with node . We will refer to node and the associated random variable , interchangeably. Suppose that can take on values in . Let denote the set of parent node indices for node , let denote the number of parents for node , and let denote the number of possible parent configurations. For example, for the network in in Figure 1,

 |Π4|=2and||Π4||=||X2||⋅||X3||.

For , , and , define parameters

 θijk:=P(Xi=k|Πi=j).

Here, “” refers to a specific set of values for the parent variables, configurations of which have been enumerated and are now indexed by .

Now, from the joint density (1), the likelihood for the entire dimensional data set may be written as

 L(θ):=∏i,j,kθnijkijk

where is the total number of times in the sample that is observed to have value when its parents take on configuration . (To keep this concise expression for the likelihood, we define, for a parentless node , and .)

There are many ways to recover networks from data. Indeed, we may not even want to think in terms of “one best network” and instead use a model averaging approach or one that constructs a best network by combining best “features” (for example edges) from several networks. In this paper, we will restrict our attention to simple methods for recovering a single “best” network as measured by various standard likelihood and information criterion indices. We will assume a manageable number of networks to score. Of course, the number of possible networks increases superexponentially in the number of nodes– the results of this paper might then be applied using Monte Carlo search strategies.

1. Log-Likelihood (LL)

Given -tuples of data points, we compute the log-likelihood for every possible graph. For each DAG, we have a different set of relevant

parameters. Given a particular DAG, we estimate each

with its maximum likelihood estimator

 ^θijk=\# observations with Xi=k and Πi=j\# observations with Πi=j,

and then we report the log-likelihood

 lnL(^θ)=∑i,j,knijkln(^θijk).

As we can always increase the log-likelihood by including additional parameters, we expect the greatest log-likelihoods (“most likely models”) to coincide with DAGs with a maximal number of edges. Therefore, it is important include a penalty for overparameterized models. The two most common penalized log-likelihood statistics are given by the following information criteria.

2. Akaike’s Information Criterion (AIC)

Akaike’s Information Criterion (AIC) is, in its most general form, defined by

 AIC=−2lnL+2⋅(#parameters).

Clearly, the goal is to minimize the AIC to ensure a good fitting model in the sense of maximizing the log-likelihood while penalizing for having too many parameters.

3. Bayesian Information Criterion (BIC)

The Bayesian Information Criterion (BIC) is defined by

 BIC=−2lnL+(#parameters)⋅ln(m),

where is, as before, the sample size. As with the AIC, the goal is to minimize the BIC.

Both the AIC and BIC have rigorous justifications, from both Bayesian and frequentist points of view. AIC was derived from information theory though it can be though of as Bayesian if one uses a clever choice of prior. On the other hand BIC, originally derived through Bayesian statistics as a measure of the Bayes factor, can also be derived as a non-Bayesian result. For more information on these widely used scoring criteria, we refer the interested reader to

,, , and  (AIC), and  and  (BIC). To make some broad generalizations, AIC, it can often overfit the model in terms of number of parameters. BIC, on the other hand, tends to overpenalize, or underfit the model.That being said, we have been most successful in recovering the correct structure of simulated Bayesian networks using AIC. Incidentally, the use of Bayesian networks does not necessarily imply a Bayesian approach to modeling where parameters are considered as random variables. In this paper, we are taking the frequentist approach.

## 3 Minimum Description Length

Arguably, the point of statistical modeling is to find regularities in an observed data set. Discovered regularities allow the modeler to be able to describe the data more succinctly. The minimum description length (MDL) principle, introduced in 1978 by Jorma Rissanen , is a model selection technique that chooses, as the best model, the one that permits the shortest encoding of both the model and the observed data.

In this section, we describe how Friedman and Goldszmidt  define a description length that can be used for network recovery from discrete data. It is the approximate length of storage space, measured in bits, for the binary representation of the DAG (network structure), the parameters (the ) that, together with the DAG, define a Bayesian network, and the data itself. We will see, in the end, that it is simply another penalized likelihood approach.

In what follows, we repeatedly use the fact that an integer can be encoded in approximately bits. For example, the decimal value becomes, in binary, the bit number 1001. Throughout this paper, we will use to denote the base logarithm, though, in the end the base is unimportant for the comparisons we will make.

### Encoding the DAG

As in Section 2, we assume that , can take on possible values, and that they are integers ranging from to . (For example, if can take on values in , we would relabel the values as , , and .)

The approximate number of bits needed to encode the number of nodes/variables , and the number of possible values taken on by each of those variables is

 logn+n∑i=1log||Xi||. (2)

In order to completely describe the network structure, we must also include the number of parents for each node and the actual list of parents for each node. As a simplification, since the number of parents of node , which is denote by , is always less than , we can, conservatively, reserve bits to encode . Doing this for each node, we add

 n∑i=1logn=nlogn (3)

to our network description length.

For the actual list of parents, since the maximum value in the list of indices is , we will use the conservative value of bits to encode each index. For node , we must encode different indices, each using a length of bits of space. So, in total, to encode all parent lists, we will use

 n∑i=1|Πi|⋅logn (4)

bits.

In total, we use

 logn+n∑i=1log||Xi||+n∑i=1(1+|Πi|)⋅logn (5)

bits to encode the DAG structure.

### Encoding the Parameters

The Bayesian network consists of a DAG together with a collection of parameters . Since , we only need to encode parameters for node . However, the parameters are not integer valued. In our case, for network recovery, we will actually be storing/encoding parameters that have been estimated from our -dimensional data points. Friedman and Goldszmidt  indicate that the “usual choice in the literature” is to use bits per parameter. Thus, we will use

 12logm⋅n∑i=1||Πi||⋅(||Xi||−1) (6)

bits to encode the estimated network parameters.

We add (5) and (6) in order to define description length for the Bayesian network. Since we are trying to infer the best connecting arrows for a graph on nodes, we presumably already know the number of nodes and would not be encoding it. Thus, we drop the term and define the network description length as

 DLnet=n∑i=1log||Xi||+n∑i=1(1+|Πi|)⋅logn+12logm⋅n∑i=1||Πi||⋅(||Xi||−1). (7)

### Encoding the Data

Consider the string of digits stored, in binary as . (The binary representations of , , and are , , and , respectively.) Without separators, the binary string can not be decoded near the end. The maybe be, in decimal, three ’s, a followed by a , or a followed by a . “Prefix codes” avoid this problem by encoding the digits with zeros and ones in a way so that no encoded digit is a prefix of any other encoded digit. Further consideration can be made to ensure the code is not only a prefix code, but one that results in the maximum compression of the data by assigning, to the original decimal digits, binary codes of lengths inversely proportional to the frequencies with which they appear in the original data.

In the context of Bayesian networks, we wish to encode the values in the -dimensional data set with code lengths that are inversely proportional to the frequencies (equivalently, estimated probabilities) with which they appear in the data. Friedman and Goldszmidt  use Shannon coding  which is not optimal in terms of compression, but which encodes each -dimensional data point using approximately bits where is the joint density for evaluated at . Thus, the entire data set, consisting of

such vectors is encoded in approximately

 DLdata=−m∑i=1logp(→xi) (8)

bits. This is desirable from a modeling standpoint since it corresponds to the familiar log-likelihood commonly used in statistical inference.

Summing (7) and (8), we define the description length for a Bayesian network and the observed data as

 DL=n∑i=1log||Xi||+n∑i=1(1+|Πi|)⋅logn+12logm⋅n∑i=1||Πi||⋅(||Xi||−1)−m∑i=1logp(→xi). (9)

Given discrete data and a collection of possible Bayesian networks, the network chosen by the MDL principle is the one that minimizes (9). In light of the form of , we see that this is simply a penalized log-likelihood scoring metric, similar to the AIC and BIC discussed in Section 2. We now illustrate the performance of all three for a three node network. The list of all 25 DAGs corresponding to 3 node networks can be found in Table 1 and we will refer to these DAGs as they are numbered here.

We fixed arbitrary parameters for DAG 8 in Table 1 and simulated values of . (Each was assumed to take on values in .)

Results are shown in Table 2 along with the previously computed values of AIC and BIC. The AIC, BIC, and MDL scores all recovered the correct network “up to Markov equivalence”.A set of DAGs with the same set of edges are said to be Markov equivalent if they share the same set of conditional independence relations among variables. Recovering the specific graph within a Markov equivalence class (a causality problem) requires experimental as opposed to simply observed data and is not the subject of this paper.

## 4 Minimum Description Length for Discretization

A discretization of data for the random variable , represented by node , is a mapping from the range of values in the data set to the set for some . It can be described by ordering the distinct values observed for and inserting up to “thresholds”. For example

 0.380.420.530.71map % to 1|1.371.94 2.10% map to 2|5.387.11map to 3. (10)

In this paper, we will assume that only node needs to be discretized and that the remaining nodes are discrete or have already been discretized. We refer the reader to Friedman and Goldszmidt  for a discussion of multiple node discretization which essentially involves fixing discretizations for all but one node, discretizing that node, and repeating the process by cycling through all nodes repeatedly.

As before, we assume that we have observations of the -dimensional . Let be the number of distinct values taken on by in the data set. Note that , with equality possible only for truly continuous data. Define as the discretized version of . To discretize to values, we need to choose thresholds to put in spaces between ordered values. There are threshold configurations to consider. Since we will not know in advance how many values we should have in the discretized data set, we need to consider everything from , which corresponds to mapping all values for in the data set to the single value of 1, to , which corresponds to no discretization at all. In total, there are

 (mi−10)+(mi−11)+…(mi−1mi−1)=2mi−1

discretizations to consider.

Friedman and Goldszmidt  define a description length score for a network with discretized into a particular configuration of thresholds using essentially four terms. These terms include and , previously described in (7) and (8), computed now after discretization– we will call these terms and . Also included are terms that encode the index denoting a particular discretization policy and description length for information needed to recover the original data from the discretized data.

### Encoding the Discretization Policy

For fixed and , there are different possible configurations for thresholds. Assume we have labeled them from to . Storing the index for a particular policy will take at most bits. Friedman and Goldszmidt use a conservative upper bound based on the inequality

 (nk)≤2nH(k/n),

where

 H(p):=−plogp−(1−p)log(1−p),

and conservatively reserve

 DLDP=(mi−1)H(ki−1mi−1) (11)

to encode the discretization policy. (In practice, we define .)

### Encoding Recovery of Original Data

Consider again the example of 9 () distinct values for given by (10) from a data set with values. Every time we assign a discretized value to an original value, we should store the original value for recovery. Instead, however, we will store the Shannon binary code for the original value using estimated conditional probabilities based on the entire data set. For example, the value might appear in the entire data set one-fourth of the time. That is, . Among the instances of , , and , it might appear half of the time. That is, . Given a discretized value of , we will encode the original value, , using approximately bits. In total, for recovering original data from discretized data, we add

 DLrec=−m∑i=1logˆP(Xi|X∗i) (12)

to the description length.

In summary, Friedman and Goldszmidt  define the description length discretization score as

 DL∗=(mi−1)H(ki−1mi−1)+DL∗net+DL∗data+DLrec. (13)

Given a data set and a particular network structure, one scores various discretization mappings for and chooses the discretization that minimizes (13). For a given network, the score in (13) will change over discretizations only in terms directly linked to the th node. Thus, as Friedman and Goldszmidt point out, we only need to consider the “local description length score” defined, by picking out relevant terms, as

 (14)

Here, is a the set of parents for node , denoted with an asterisk since it includes the discretized , and,

 ˆI(→X,→Y)=∑→x,→yˆP(→X=→x,→Y=→y)⋅log(ˆP(→X=→x,→Y=→y)ˆP(→X=→x)ˆP(→Y=→y)) (15)

is the estimated mutual information between random vectors and . ( is defined to be zero.)

## 5 Minimizing the Local Description Length Score

Computation of a single value of (14), which is based on network structure through the information terms, can be quite time consuming. For even moderately sized , computation of (14) repeatedly to check all discretization policies can be prohibitive, and when multiple/all nodes need to be discretized (a procedure that involves several passes through each node as discussed in ), computation of (14) becomes almost impossible. Friedman and Goldszmidt  give some further computational simplifications and suggest a greedy search routine.

The purpose of this paper is to provide an alternative efficient search strategy for the smallest score. We assert that, for a single node discretization, one need only check values of as opposed to . Multiple continuous nodes can then be cycled for discretization just as Friedman and Goldszmidt have suggested.

For notational simplicity, we assume, for the remainder of this paper, that the node to be discretized is labeled as node . Also, as we are comparing values of for various discretizations of , we will drop all asterisk superscript notation, as it is understood that we are considering discretized values.

### Single Threshold Top-Down Search Strategy

Let be (14) with all thresholds in place. For , let be (14) with all thresholds except for the th threshold.

In order to minimize over all discretization policies for , make comparisons of with for . If , remove the th threshold.

We first consider the effectiveness of this search strategy in the very ideal situation where we augment given discrete data by introducing superfluous values for node 1 in a larger discrete set. For example, we might replace values of in the original data set with values in with some arbitrary probabilities, whereupon we hope that our search strategy will minimize and that the configuration of thresholds that does such will correctly map values in back to the original value of . We will prove, in this case, that the “single threshold top-down” search strategy will find the discretization policy that minimizes among all discretizations for a large enough sample size .

Indeed, , , , and are constant, and . Note that

 (m1−1)H(k1−1m1−1)+logk1+12logm⎡⎣||Π1||(k1−1)+∑j:X1∈Πj||Πj||(||Xj||−1)⎤⎦ (16)

will be constant over any discretization policy for with a fixed number of thresholds. In particular, when comparing all possible single threshold removals, starting with any fixed number of thresholds, we can restrict our attention to maximizing

 ˆI(X1,Π1)+∑j:X1∈ΠjˆI(Xj,Πj). (17)

In the next Section, we consider what our search strategy does to (17) in the absence of (16) in an ideal situation where there is a “correct” discretization. We will see that it maximizes (17) and that it does so while leaving a minimal number of thresholds. In Section 5.2, we will consider (16) and conclude that we are indeed minimizing . We will also see that it finds the correct discretization.

### 5.1 A Closer look at Information Terms in an Ideal Situation

As the notation to follow gets a bit cumbersome in the general case, we illustrate the proof of most claims in this Section and the next with a concrete example. Consider a two-node network where node is a parent to node , and assume that that both nodes take on values in . From data points, we can produce estimates

 ˆp(i,j):=ˆP(X1=i,X2=j)ˆp1(i):=ˆP(X1=i),andˆp2(j):=ˆP(X2=j),

for .

In general, for this two node network, we would assume that node 1 takes on values in , node 2 takes on values in , and one would work with the estimates , , and for and .

We now “explode” the data for our specific example at node into values in by replacing instances of with values in with probabilities and , respectively, replacing original instances of with values in with probabilities , , and , respectively, and replacing original instances of with the value . Thus, in our “exploded data set”, node is taking on values in with probabilities denoted as for , where, for example, and . Joint probabilities for and are denoted by where we have, for example,

 ˜p(1,j)=13ˆp(1,j)and˜p(4,j)=47ˆp(2,j).

In the more general case, we could “explode” the data at node more generally by replacing instances of with values in some probabilities , summing to 1, replacing original instances of with values in with respective probabilities , summing to 1, and so forth.

By “exploding” the data at node one, we have introduced superfluous values for in terms of probabilities for . For example, for all . Any proper discretization process should aggregate the values and for back into one value. Here, is used to denote the probability . Similarly, may be denoted as .

For this two-node network, (17) is simply , which we will denote by . Define this estimated information as

 ˆI:=∑i,jˆIi,j (18)

where

 ˆIij:=ˆp(i,j)⋅log(ˆp(i,j)ˆp1(i)⋅ˆp2(j)) (19)

and the sums run over and .

Define the corresponding information and information terms using in place of with sums running over and .

It is easy to verify information term relationships such as

 ˜I1,j=13ˆI1,jand˜I4,j=47⋅ˆI2,j,

and consequently that

 ˜I=∑i∈{1,2,…6}j∈{1,2,3}˜Iij=∑i∈{1,2,3}j∈{1,2,3}ˆIij=ˆI.

That is, we did not change the information between and by exploding the data at .

We now show that our single threshold top-down search strategy will correctly recover the original values for . We call this the “correct discretization”, denoted as , which means that we will remove three thresholds from the “full discretization”, denoted as , and that the values and will map back to , the values in will map back to , and the value , will map back to . We will assume that the original values in are all distinct in the sense that for some , for some , and for some , so that they should not be aggregated further.

### Removing a Threshold: The Impact on Information

Starting with the exploded data, with values for represented as , we consider the information term after removing the threshold between the values and for some . Note that removal of this th threshold will leave all values below the threshold unchanged, while all values above will be decreased by 1. For example, if we remove the third threshold, we denote the new configuration as , but it represents a mapping

 1map to 1|2map to 2|34map to 3|5map to 4|6map to 5.

Define, for and , the joint and marginal probabilities for and after the th threshold is removed as , , and . We have, for , the relationships

 p(r)(i,j)=˜p(i,j),i∈{1,2,…,r−1}(r>1)p(r)(r,j)=˜p(r,j)+˜p(r+1,j)p(r)(i,j)=˜p(i+1,j),i∈{r+1,r+2,…,6},(r<5)p(r)1(i)=˜p1(i),i∈{1,2,…,r−1}p(r)1(r)=˜p1(r)+˜p1(r+1)p(r)1(i)=˜p1(i+1),i∈{r+1,r+3,…,6},(r<5),

and

 p(r)2(j)=˜p2(j)=ˆp2(j).

Defining and analogous to (18) and (19), using , we have, for ,

 I(r)rj=p(r)(r,j)⋅log(p(r)(r,j)p(r)1(r)p(r)2(j))=[˜p(r,j)+˜p(r+1,j)]⋅log(˜p(r,j)+˜p(r+1,j)[˜p1(r)+˜p1(r+1)]⋅˜p2(j))≤˜p(r,j)⋅log(˜p(r,j)˜p1(r)⋅˜p2(j))+˜p(r+1,j)]⋅log(˜p(r+1,j)˜p1(r+1)⋅˜p2(j))=˜Irj+˜Ir+1,j.

The inequality is is due to the log-sum inequality,

 n∑i=1ailog(aibi)≥[∑iai]log(∑iai∑ibi),

which holds for any nonnegative and . The log-sum inequality can be shown to be an equality if and only if the are equal for all . Thus, we have that

 I(r)rj≤˜Irj+˜Ir+1,j (20)

for , with equality if and only if

 ˜p(r,j)˜p1(r)⋅˜p2(j)=˜p(r+1,j)˜p1(r+1)⋅˜p2(j).

This happens if and only if

 ˜P(X2=j|X1=r)=˜P(X2=j|X1=r+1). (21)

for , which is precisely when the values and in the exploded version of should be aggregated or discretized into one value. If we do not have (21), aggregating the values will result in a loss of information.

Note that denotes the value of the information between and with the th threshold in the explosion for removed. In our example,

 ˜P(X2=j|X1=1)=˜p(1,j)˜p1(1)=(1/3)ˆp(1,j)(1/3)ˆp1(1)=ˆP(X2=j|X1=1)

and

 ˜P(X2=j|X1=2)=˜p(2,j)˜p2(1)=(2/3)ˆp(1,j)(2/3)ˆp1(1)=ˆP(X2=j|X1=1).

Thus, we have (21), when , and consequently

 I(1)=∑5i=1∑3j=1I(1)ij=∑3j=1I(1)1j+∑5i=2∑3j=1I(1)ij=∑3j=1I(1)1j+∑6i=3∑3j=1˜Iij=∑3j=1(˜I1j+˜I2j)+∑6i=3∑3j=1˜Iij=∑6i=1˜Iij=˜I.

On the other hand, considering the second threshold removal from the full discretization , similar calculations show that

 ˜P(X2=j|X1=2)=ˆP(X2=j|X1=1)

which is not, in general, equal to

 ˜P(X2=j|X1=3)=ˆP(X2=j|X1=2),

so we have the strict inequality and therefore

 I(2)=∑5i=1∑3j=1I(2)ij=∑3j=1I(2)1j+∑3j=1I(2)2j+∑5i=3∑3j=1I(2)ij=∑3j=1˜I1j+∑3j=1I(2)2j+∑6i=4∑3j=1˜Iij<∑3j=1˜I