I Introduction
Graphical models have a wide range of applications in various fields, such as signal processing, coding theory, and bioinformatics [9]. Inferring the graphical model from data is a starting point in such applications.
In this paper, we learn a probabilistic relation among variables from a data frame that consists of samples
The conditional independence relations can generally be expressed by a Bayesian network. However, estimating the optimal structure given the data frame is difficult because more than an exponential number of directed acyclic graphs with respect to
are candidates. In this paper, we assume that the underlying model is a forest rather than a Bayesian network.The forest learning problem has a long history and has been investigated by several authors. The basic method was considered by Chow and Liu [4]: connect each pair of variables (vertices) as an edge, as long as no loop is generated by the connection, in ascending order of mutual information to obtain a forest. The algorithm assumes that the mutual information values are known, and the resulting tree expresses an approximation of the original distribution. However, we may start from a data frame and connect the edges based on the estimations of mutual information values. Recently, Liu et al. [10] estimated its kernel density to prove its consistency for continuous variables in a highdimensional setting, and Tan et al. [16] restricted the number of edges to prove consistency for discrete variables in a different highdimensional setting.
The approach that we take in this paper is essentially different. We note that by adding an edge to a forest without creating any loop, the complexity of the forest increases while the likelihood of the distribution that the forest expresses improves. In this sense, the balance should be considered to obtain a correct forest. In 1993 [12], the author considered a modified estimation of mutual information that takes the balance into account and applied it to the ChowLiu algorithm. The resulting undirected graph is not a tree but a forest, and the estimation satisfies consistency, avoiding overfitting because it minimizes its description length rather than maximizes the likelihood. Our estimation in this paper is Bayesian and slightly different from [12], although they are essentially equivalent. Specifically, we find a model that maximizes the posterior probability given the data frame when some of the values are missing [11, 8].
In general, it is computationally difficult to obtain the Bayes optimal solution in model selection with incomplete data. In fact, suppose that the variables are binary and that of the values are missing. Thus, we will need to obtain
Bayes scores for each candidate model, where the score is defined by the prior probability of a model multiplied by the conditional probability of the data frame given the model, because the
missing values should be marginalized.An alternative approach to both reduce the computational effort and obtain a correct model as the sample size increases is to select a model based on the samples such that all the values are available. This method ensures consistency, i.e., a correct model is obtained for large if the size of such samples also becomes large and if an appropriate model selection method is applied to those samples. However, this method excludes samples such that at least one value is missing, and it eventually fails to obtain the Bayes optimal model. In this paper, we assume that the underlying model is a forest rather than a general graphical model to solve such problems.
The remainder of this paper is organized as follows. In Section 2, assuming that no value is missing in a given data frame, we consider a Bayes optimal mutual information estimator to avoid such overfitting. In Section 3, we construct a model selection procedure that maximizes the posterior probability given a data frame that may contain missing values. The computation is at most for variables. A surprising result is that the model that maximizes the posterior probability does not necessarily converge to a correct model as increases for an incomplete data frame. Moreover, we illustrate the theoretical results by showing experiments using the Alarm [1] and Insurance [3] data sets as benchmarks. In Section 4, we evaluate the code length of each data frame and the expected redundancy per sample when some values may be missing, where redundancy is defined by the difference between the expected compression ratio and entropy for a given pair of coding and source. Section 5 summarizes the results and presents directions for future work.
Ii Forest Learning from Complete Data
We assume that each random variable takes a finite number of values. By a forest, we mean an undirected graph without any loops. If vertex and edge sets are given by
and a subset of , respectively, then we have a distribution in the form(1) 
Moreover, if we specify the probabilities and , then the distribution (1) is uniquely determined. For example, although the distributions
and
may be expressed by directed acyclic graphs as in Figure 1 (a) and Figure 1 (b), respectively, both can be expressed by the distribution
and the undirected graph in Figure 1 (c), where the vertices and edges , , , , and correspond to and , respectively.
First, suppose that the distribution
is known. We consider maximizing the KullbackLeibler divergence
due to approximating the true distribution to a distribution in the form (1):(2)  
where , , and are the entropies of and , and the mutual information of (), respectively. To minimize , because the first two terms in (2) are constants that do not depend on , we find that maximizing the mutual information sum is sufficient. For this purpose, we apply Kruskal’s algorithm that, given symmetric nonnegative weights , , obtains a spanning tree (a connected forest) such that the weight sum is maximized: let be the empty set and at the beginning, and continue to

add a pair to with the largest among if connecting them does not cause any loops to be generated, and

remove the from (irrespective of whether the is connected)
until is empty. The ChowLiu algorithm (1968) [4] uses as the weight to minimize . For example, in Figure 2, if , then and are connected at the beginning, but will not be connected because connecting 2 and 3 causes a loop to be generated although the pair has the third largest mutual information value. Furthermore, is to be connected because it has the fourth largest mutual information value and no loop will be generated. The procedure terminates at this point because a loop will be generated if any additional pair of vertices is connected.
Next, we consider the case in which no distribution but only a data frame is given. In this section, we assume that no value is missing in the data frame.
A naive approach to generate a forest is to estimate a mutual information value by the quantity
(3) 
from the occurrences of in the samples and plug into the ChowLiu algorithm. Although converges to as increases, is always positive; thus, the ChowLiu algorithm always generates a spanning tree. For example, two variables are to be connected for even if they are independent. This is because maximum likelihood may overfit and in such cases, eventually no consistency is obtained.
In 1993, Suzuki [12] proposed replacing the quantity by another estimation of mutual information
(4) 
where and are the numbers of values that and take. Later, the same author [13] found that the value of (4) coincides with
(5) 
up to terms, where the quantity , which is termed a Bayes measure and is defined later in this section, is computed from samples with respect to () and satisfies
and
for .
This method is similar to in the sense that as , but it also has a property that it takes a negative value for large if and only if and are independent, written as . Moreover, if the prior probability of is , then deciding if and only if
maximizes the posterior probability of the decision, which is equivalent to
(6) 
when . We can show that the decision (6) is correct for large , which can be stated as follows:
We shall now define a Bayes measure. For random variable that takes zero and one, let be the probability of . Then, the probability that independent sequence with ones and zeros occurs can be written as . If the prior density of is available, then by integrating over , we can compute the measure of without assuming any specific :
If the weight is in the form
with , then we obtain the Bayes measure as
(7) 
where is the Gamma function . For example, suppose that and . Then, we have
We define quantities similarly to assuming that , , and take values in (), (), and , respectively; the computation for can be extended to those for , , and , respectively. For example, for with , constants and occurrences are replaced by and , respectively, for ; thus, the extended formula can be expressed by [7, 14]
For completeness, we show the proofs of the derivations from (5) to (4) and Proposition 1 in Appendices A and B, respectively.
Example 1 (Experiment)
We show box plots that depict the realizations of and when the mutual information values are zero and positive, and we find that is always larger than , which is due to its overfitting (Figure 3). Specifically, cannot detect independence because the value always exceeds zero.
In contrast, Kruskal’s algorithm works even when some weights are either zero or negative: a pair is connected only if the weight is positive. If we apply mutual information values based on (5) rather than (3) to the ChowLiu algorithm, then it is possible that the value of (5) is negative, which means that the two variables are independent:
Proposition 2
A pair of nodes need not be connected even when connecting them does not cause any loops to be generated.
Proof of Proposition 2: Kruskal’s algorithm connects as an edge only nodes with a positive weight.
Hence, the ChowLiu algorithm based on (5) may terminate before causing overfitting. Moreover, via (1), sequentially choosing an edge with the maximum (5) is equivalent to choosing a forest with the maximum
(8) 
In fact, the first term on the righthand side of
(9) 
is constant irrespective of , and minimizing and maximizing the sum of values over are equivalent. Therefore, if we prepare the uniform prior over the forests, then the ChowLiu algorithm based on (5) chooses a forest with the maximum posterior probability given the samples. If the prior probability is given by
where and , then we can add to (9) and replace (5) with [13]
Moreover, consistency also holds. In fact, because as , the orders of and asymptotically coincide, and from (6), the timing when the procedure terminates is asymptotically correct.
Comparing the ChowLiu algorithms based on and that maximize the likelihood and posterior probability, respectively, we find that

does not necessarily generate a spanning tree but a forest.

the edge set generated by is not necessarily a subset of the spanning tree generated by .
Iii Forest Learning from Incomplete Data
We consider an extension of the ChowLiu algorithm based on (5) such that it addresses a data frame that contains missing values. Specifically, we construct quantities and to generalize (5) to respectively obtain

a forest that maximizes the posterior probability given a data frame

a forest that converges to the true one as increases.
Given a data frame, for each , we compute and based on the samples such that none of the values of and are missing.
First, we arbitrarily choose a root for each connected subgraph of the forest. Let
and be the Bayes measure with respect to the root for nonmissing values . In particular, represents one of the variables, and is obtained via
where is the number of the occurrencies of in , and ranges over the values that takes. Note that with is replace by with because some values are miising.
Because each connected subgraph is a spanning tree, a directed path from its root to each vertex in the subgraph is unique, and a directed edge set is determined from . We arbitrarily fix a directed edge from the upper to the lower , and let
Moreover, let , , , , and be the Bayes measures with respect to , , , , and , respectively.
Suppose that we have a data frame consisting of columns in which the values of the two variables are and , where ”*” denotes missing. Then, , , and such that the scores , , , , and are associated with the sequences , , , , and .
Since and depend on and , respectively, rather than on , the Bayes measure with respect to can be evaluated as
This means that the Bayes measure with respect to the whole forest is evaluated as
(10)  
Note that does not depend on the edge set and that (8) is a special case of (10).
Since maximizing (10) is equivalent to maximizing the product of in , to maximize the posterior probability when the prior distribution over the forests is uniform, it is sufficient to choose that maximizes
(11) 
among the pairs based on the samples such that none of the values of and are missing. We find that (11) contains (5) as a special case in which no value is missing in the samples for all the pairs. We summarize the above discussion as follows:
Theorem 1
If we apply in (11) to the ChowLiu algorithm, then we obtain a forest with the maximum posterior probability.
Proof of Theorem 1: Maximizing the Bayes measure (10) is equivalent to maximizing (11) at each step of Kruskal’s algorithm. Since an optimal solution results even with a greedy choice of such pairs, we obtain a forest with the maximum posterior probability.
We find that the value of (11) coincides with the mutual information estimator
(12) 
multiplied by the nonmissing ratio , where is the number of nonmissing samples for pair , the cardinality of . Therefore, under for some , it is possible that maximizing the mutual information estimator does not necessarily mean maximizing . Then, we have the following propositions:
Proposition 3
Suppose that as with probability one for each (). Then,
(13) 
as with probability one.
Proposition 4
Suppose that as with probability one for each (). Then,
(14) 
as with probability one.
Proposition 5
Suppose that as with probability one for each (). Then, if we apply to the ChowLiu algorithm, the generated forest is the true one as with probability one.
Proofs of Propositions 3,4, and 5: If no missing value exists in the original ones, from the definitions of and and Proposition 3, we have
as with probability one. The propositions consider an extended case in which some values may be missing. Since the occurrence is independent and identically distributed, evaluates independence based on the nonmissing with , where , such that and
as with probability one, which proves Proposition 3. Meanwhile, since , we have
which proves Proposition 4. Proposition 5 occurs because the orders of and asymptotically coincide (Proposition 3) and the timing when the ChowLiu algorithm terminates is asymptotically correct (Proposition 4).
Finally, we show that Propositions 3 and 5 do not hold for , which means that in model selection with respect to incomplete data, the model that maximizes the posterior probability may not be asymptotically correct.
As an extreme case, if all of the values are missing for among , and , then even if the values of and are large, only will be connected:
Proposition 6
Maximizing the posterior probability does not imply asymptotic consistency when selecting models with incomplete data.
We prove the proposition by constructing such a case. In the following example, the values of are not missing with a positive probability:
Example 2
We assume that , with , and and are independent. Then, the true forest should be because is . We also find that
and
However, if we further assume that is missing with probability
(15) 
and that no values of and are missing, then we find that asymptotically chooses an incorrect forest because
and (15) is equivalent to .
Both ChowLiu algorithms based on and complete in time.
Example 3 (Experiments)
For complete data, we used the CRAN package BNSL that was developed by Joe Suzuki and Jun Kawahara [15]. The R package consists of functions that were written using Rcpp [6] and run 50100 times faster than the usual R functions. For the experiments, we use the data sets Alarm [1] and Insurance [3], which are used often as benchmarks for Bayesian network structure learning.
library(BNSL); mm=mi_matrix(alarm); edge.list=kruskal(mm); g=graph_from_edgelist(edge.list, directed=FALSE); plot(g,vertex.size=1)
Before execution, the BNSL package should be installed via
install.packages("BNSL")
The functions mi_matrix
and kruskal
obtain the mutual information value matrix and its edge list obtained by Kruskal’s algorithm, respectively,
and the last two lines output the graph using the function plot
in the igraph
library.
For incomplete data, however, we constructed modified functions that realize and for the experiments.
Figure 4 depicts the forests for the complete data with respect to Alarm and Insurance, which contain samples for 37 and 27 variables, respectively (consult references [1] and [3] for the meanings of the variables numbered 137 and 127), using the functions in the BNSL package. The forests were generated in a few seconds.
Then, we generated forests with respect to Alarm for the first and samples, but the first ten variables out of 37 were missing with probability and . We addressed those data frames using and . Table I shows the entropy of the generated forests (the data set is random due to the noise, and the resulting forest will be random). We can consider that the less the entropy, the more stable the estimation, and we observe that the estimation via is less stable compared with the estimation via for small and large , which appears to be because the sample size decreases from to on average for the first 10 variables, but the estimation of is multiplied by on average even if the estimation is based on the small sample size
; thus, the estimation variance is rather large. However, for large
and small , the estimation via is more correct than the one via .We expected that for large , the consistent estimation via is closer to the true forest than that via . To examine this expectation, we generated forests using and for and , and the entropy values were 1.798 and 1.195, respectively. Let be the forests as in Figure 5 and Table II. The function chooses the correct edges more often than . We observe that and are almost close except for the edges in subgraph . Because we added noise to the first 10 variables, it is likely that the mutual information estimation between the eighth and ninth variables was underestimated even when is large. Function chose and for 97.5% of the data sets, while chose them for 87 % of the data sets.
100  5.769  6.392  200  5.569  6.369  200  7.164  7.526 

200  3.699  4.349  500  3.419  3.823  500  5.751  6.914 
500  1.963  1.995  1000  2.552  2.513  1000  4.834  5.084 
1000  0.941  0.840  2000  2.117  2.065  5000  1.366  1.297 
Iv Universal Coding of Incomplete Data
In this section, we consider encoding a data frame.
We claim that the entropy when no value is missing is given by
(16) 
where is the edge set obtained by applying the ChowLiu algorithm to the set . In fact, (16) is , where