Graphical models have a wide range of applications in various fields, such as signal processing, coding theory, and bioinformatics . Inferring the graphical model from data is a starting point in such applications.
In this paper, we learn a probabilistic relation among variables from a data frame that consists of samples
The conditional independence relations can generally be expressed by a Bayesian network. However, estimating the optimal structure given the data frame is difficult because more than an exponential number of directed acyclic graphs with respect toare candidates. In this paper, we assume that the underlying model is a forest rather than a Bayesian network.
The forest learning problem has a long history and has been investigated by several authors. The basic method was considered by Chow and Liu : connect each pair of variables (vertices) as an edge, as long as no loop is generated by the connection, in ascending order of mutual information to obtain a forest. The algorithm assumes that the mutual information values are known, and the resulting tree expresses an approximation of the original distribution. However, we may start from a data frame and connect the edges based on the estimations of mutual information values. Recently, Liu et al.  estimated its kernel density to prove its consistency for continuous variables in a high-dimensional setting, and Tan et al.  restricted the number of edges to prove consistency for discrete variables in a different high-dimensional setting.
The approach that we take in this paper is essentially different. We note that by adding an edge to a forest without creating any loop, the complexity of the forest increases while the likelihood of the distribution that the forest expresses improves. In this sense, the balance should be considered to obtain a correct forest. In 1993 , the author considered a modified estimation of mutual information that takes the balance into account and applied it to the Chow-Liu algorithm. The resulting undirected graph is not a tree but a forest, and the estimation satisfies consistency, avoiding overfitting because it minimizes its description length rather than maximizes the likelihood. Our estimation in this paper is Bayesian and slightly different from , although they are essentially equivalent. Specifically, we find a model that maximizes the posterior probability given the data frame when some of the values are missing [11, 8].
In general, it is computationally difficult to obtain the Bayes optimal solution in model selection with incomplete data. In fact, suppose that the variables are binary and that of the values are missing. Thus, we will need to obtain
Bayes scores for each candidate model, where the score is defined by the prior probability of a model multiplied by the conditional probability of the data frame given the model, because themissing values should be marginalized.
An alternative approach to both reduce the computational effort and obtain a correct model as the sample size increases is to select a model based on the samples such that all the values are available. This method ensures consistency, i.e., a correct model is obtained for large if the size of such samples also becomes large and if an appropriate model selection method is applied to those samples. However, this method excludes samples such that at least one value is missing, and it eventually fails to obtain the Bayes optimal model. In this paper, we assume that the underlying model is a forest rather than a general graphical model to solve such problems.
The remainder of this paper is organized as follows. In Section 2, assuming that no value is missing in a given data frame, we consider a Bayes optimal mutual information estimator to avoid such overfitting. In Section 3, we construct a model selection procedure that maximizes the posterior probability given a data frame that may contain missing values. The computation is at most for variables. A surprising result is that the model that maximizes the posterior probability does not necessarily converge to a correct model as increases for an incomplete data frame. Moreover, we illustrate the theoretical results by showing experiments using the Alarm  and Insurance  data sets as benchmarks. In Section 4, we evaluate the code length of each data frame and the expected redundancy per sample when some values may be missing, where redundancy is defined by the difference between the expected compression ratio and entropy for a given pair of coding and source. Section 5 summarizes the results and presents directions for future work.
Ii Forest Learning from Complete Data
We assume that each random variable takes a finite number of values. By a forest, we mean an undirected graph without any loops. If vertex and edge sets are given byand a subset of , respectively, then we have a distribution in the form
Moreover, if we specify the probabilities and , then the distribution (1) is uniquely determined. For example, although the distributions
and the undirected graph in Figure 1 (c), where the vertices and edges , , , , and correspond to and , respectively.
First, suppose that the distribution
is known. We consider maximizing the Kullback-Leibler divergencedue to approximating the true distribution to a distribution in the form (1):
where , , and are the entropies of and , and the mutual information of (), respectively. To minimize , because the first two terms in (2) are constants that do not depend on , we find that maximizing the mutual information sum is sufficient. For this purpose, we apply Kruskal’s algorithm that, given symmetric non-negative weights , , obtains a spanning tree (a connected forest) such that the weight sum is maximized: let be the empty set and at the beginning, and continue to
add a pair to with the largest among if connecting them does not cause any loops to be generated, and
remove the from (irrespective of whether the is connected)
until is empty. The Chow-Liu algorithm (1968)  uses as the weight to minimize . For example, in Figure 2, if , then and are connected at the beginning, but will not be connected because connecting 2 and 3 causes a loop to be generated although the pair has the third largest mutual information value. Furthermore, is to be connected because it has the fourth largest mutual information value and no loop will be generated. The procedure terminates at this point because a loop will be generated if any additional pair of vertices is connected.
Next, we consider the case in which no distribution but only a data frame is given. In this section, we assume that no value is missing in the data frame.
A naive approach to generate a forest is to estimate a mutual information value by the quantity
from the occurrences of in the samples and plug into the Chow-Liu algorithm. Although converges to as increases, is always positive; thus, the Chow-Liu algorithm always generates a spanning tree. For example, two variables are to be connected for even if they are independent. This is because maximum likelihood may overfit and in such cases, eventually no consistency is obtained.
In 1993, Suzuki  proposed replacing the quantity by another estimation of mutual information
up to terms, where the quantity , which is termed a Bayes measure and is defined later in this section, is computed from samples with respect to () and satisfies
This method is similar to in the sense that as , but it also has a property that it takes a negative value for large if and only if and are independent, written as . Moreover, if the prior probability of is , then deciding if and only if
maximizes the posterior probability of the decision, which is equivalent to
when . We can show that the decision (6) is correct for large , which can be stated as follows:
We shall now define a Bayes measure. For random variable that takes zero and one, let be the probability of . Then, the probability that independent sequence with ones and zeros occurs can be written as . If the prior density of is available, then by integrating over , we can compute the measure of without assuming any specific :
If the weight is in the form
with , then we obtain the Bayes measure as
where is the Gamma function . For example, suppose that and . Then, we have
We define quantities similarly to assuming that , , and take values in (), (), and , respectively; the computation for can be extended to those for , , and , respectively. For example, for with , constants and occurrences are replaced by and , respectively, for ; thus, the extended formula can be expressed by [7, 14]
For completeness, we show the proofs of the derivations from (5) to (4) and Proposition 1 in Appendices A and B, respectively.
Example 1 (Experiment)
We show box plots that depict the realizations of and when the mutual information values are zero and positive, and we find that is always larger than , which is due to its overfitting (Figure 3). Specifically, cannot detect independence because the value always exceeds zero.
In contrast, Kruskal’s algorithm works even when some weights are either zero or negative: a pair is connected only if the weight is positive. If we apply mutual information values based on (5) rather than (3) to the Chow-Liu algorithm, then it is possible that the value of (5) is negative, which means that the two variables are independent:
A pair of nodes need not be connected even when connecting them does not cause any loops to be generated.
Proof of Proposition 2: Kruskal’s algorithm connects as an edge only nodes with a positive weight.
Hence, the Chow-Liu algorithm based on (5) may terminate before causing overfitting. Moreover, via (1), sequentially choosing an edge with the maximum (5) is equivalent to choosing a forest with the maximum
In fact, the first term on the right-hand side of
is constant irrespective of , and minimizing and maximizing the sum of values over are equivalent. Therefore, if we prepare the uniform prior over the forests, then the Chow-Liu algorithm based on (5) chooses a forest with the maximum posterior probability given the samples. If the prior probability is given by
Moreover, consistency also holds. In fact, because as , the orders of and asymptotically coincide, and from (6), the timing when the procedure terminates is asymptotically correct.
Comparing the Chow-Liu algorithms based on and that maximize the likelihood and posterior probability, respectively, we find that
does not necessarily generate a spanning tree but a forest.
the edge set generated by is not necessarily a subset of the spanning tree generated by .
Iii Forest Learning from Incomplete Data
We consider an extension of the Chow-Liu algorithm based on (5) such that it addresses a data frame that contains missing values. Specifically, we construct quantities and to generalize (5) to respectively obtain
a forest that maximizes the posterior probability given a data frame
a forest that converges to the true one as increases.
Given a data frame, for each , we compute and based on the samples such that none of the values of and are missing.
First, we arbitrarily choose a root for each connected subgraph of the forest. Let
and be the Bayes measure with respect to the root for non-missing values . In particular, represents one of the variables, and is obtained via
where is the number of the occurrencies of in , and ranges over the values that takes. Note that with is replace by with because some values are miising.
Because each connected subgraph is a spanning tree, a directed path from its root to each vertex in the subgraph is unique, and a directed edge set is determined from . We arbitrarily fix a directed edge from the upper to the lower , and let
Moreover, let , , , , and be the Bayes measures with respect to , , , , and , respectively.
Suppose that we have a data frame consisting of columns in which the values of the two variables are and , where ”*” denotes missing. Then, , , and such that the scores , , , , and are associated with the sequences , , , , and .
Since and depend on and , respectively, rather than on , the Bayes measure with respect to can be evaluated as
This means that the Bayes measure with respect to the whole forest is evaluated as
Since maximizing (10) is equivalent to maximizing the product of in , to maximize the posterior probability when the prior distribution over the forests is uniform, it is sufficient to choose that maximizes
among the pairs based on the samples such that none of the values of and are missing. We find that (11) contains (5) as a special case in which no value is missing in the samples for all the pairs. We summarize the above discussion as follows:
If we apply in (11) to the Chow-Liu algorithm, then we obtain a forest with the maximum posterior probability.
Proof of Theorem 1: Maximizing the Bayes measure (10) is equivalent to maximizing (11) at each step of Kruskal’s algorithm. Since an optimal solution results even with a greedy choice of such pairs, we obtain a forest with the maximum posterior probability.
We find that the value of (11) coincides with the mutual information estimator
multiplied by the non-missing ratio , where is the number of non-missing samples for pair , the cardinality of . Therefore, under for some , it is possible that maximizing the mutual information estimator does not necessarily mean maximizing . Then, we have the following propositions:
Suppose that as with probability one for each (). Then,
as with probability one.
Suppose that as with probability one for each (). Then,
as with probability one.
Suppose that as with probability one for each (). Then, if we apply to the Chow-Liu algorithm, the generated forest is the true one as with probability one.
as with probability one. The propositions consider an extended case in which some values may be missing. Since the occurrence is independent and identically distributed, evaluates independence based on the non-missing with , where , such that and
as with probability one, which proves Proposition 3. Meanwhile, since , we have
which proves Proposition 4. Proposition 5 occurs because the orders of and asymptotically coincide (Proposition 3) and the timing when the Chow-Liu algorithm terminates is asymptotically correct (Proposition 4).
Finally, we show that Propositions 3 and 5 do not hold for , which means that in model selection with respect to incomplete data, the model that maximizes the posterior probability may not be asymptotically correct.
As an extreme case, if all of the values are missing for among , and , then even if the values of and are large, only will be connected:
Maximizing the posterior probability does not imply asymptotic consistency when selecting models with incomplete data.
We prove the proposition by constructing such a case. In the following example, the values of are not missing with a positive probability:
We assume that , with , and and are independent. Then, the true forest should be because is . We also find that
However, if we further assume that is missing with probability
and that no values of and are missing, then we find that asymptotically chooses an incorrect forest because
and (15) is equivalent to .
Both Chow-Liu algorithms based on and complete in time.
Example 3 (Experiments)
For complete data, we used the CRAN package BNSL that was developed by Joe Suzuki and Jun Kawahara . The R package consists of functions that were written using Rcpp  and run 50-100 times faster than the usual R functions. For the experiments, we use the data sets Alarm  and Insurance , which are used often as benchmarks for Bayesian network structure learning.
library(BNSL); mm=mi_matrix(alarm); edge.list=kruskal(mm); g=graph_from_edgelist(edge.list, directed=FALSE); plot(g,vertex.size=1)
Before execution, the BNSL package should be installed via
kruskal obtain the mutual information value matrix and its edge list obtained by Kruskal’s algorithm, respectively,
and the last two lines output the graph using the function
plot in the
For incomplete data, however, we constructed modified functions that realize and for the experiments.
Figure 4 depicts the forests for the complete data with respect to Alarm and Insurance, which contain samples for 37 and 27 variables, respectively (consult references  and  for the meanings of the variables numbered 1-37 and 1-27), using the functions in the BNSL package. The forests were generated in a few seconds.
Then, we generated forests with respect to Alarm for the first and samples, but the first ten variables out of 37 were missing with probability and . We addressed those data frames using and . Table I shows the entropy of the generated forests (the data set is random due to the noise, and the resulting forest will be random). We can consider that the less the entropy, the more stable the estimation, and we observe that the estimation via is less stable compared with the estimation via for small and large , which appears to be because the sample size decreases from to on average for the first 10 variables, but the estimation of is multiplied by on average even if the estimation is based on the small sample size
; thus, the estimation variance is rather large. However, for largeand small , the estimation via is more correct than the one via .
We expected that for large , the consistent estimation via is closer to the true forest than that via . To examine this expectation, we generated forests using and for and , and the entropy values were 1.798 and 1.195, respectively. Let be the forests as in Figure 5 and Table II. The function chooses the correct edges more often than . We observe that and are almost close except for the edges in subgraph . Because we added noise to the first 10 variables, it is likely that the mutual information estimation between the eighth and ninth variables was underestimated even when is large. Function chose and for 97.5% of the data sets, while chose them for 87 % of the data sets.
Iv Universal Coding of Incomplete Data
In this section, we consider encoding a data frame.
We claim that the entropy when no value is missing is given by
where is the edge set obtained by applying the Chow-Liu algorithm to the set . In fact, (16) is , where