# Forest Learning Universal Coding

This paper considers structure learning from data with n samples of p variables, assuming that the structure is a forest, using the Chow-Liu algorithm. Specifically, for incomplete data, we construct two model selection algorithms that complete in O(p^2) steps: one obtains a forest with the maximum posterior probability given the data, and the other obtains a forest that converges to the true one as n increases. We show that the two forests are generally different when some values are missing. Additionally, we present estimations for benchmark data sets to demonstrate that both algorithms work in realistic situations. Moreover, we derive the conditional entropy provided that no value is missing, and we evaluate the per-sample expected redundancy for the universal coding of incomplete data in terms of the number of non-missing samples.

08/01/2018

### Forest Learning from Data and its Universal Coding

This paper considers structure learning from data with n samples of p va...
10/18/2021

### Regression with Missing Data, a Comparison Study of TechniquesBased on Random Forests

In this paper we present the practical benefits of a new random forest a...
11/10/2020

### On the consistency of a random forest algorithm in the presence of missing entries

This paper tackles the problem of constructing a non-parametric predicto...
04/19/2022

### Choosing the number of factors in factor analysis with incomplete data via a hierarchical Bayesian information criterion

The Bayesian information criterion (BIC), defined as the observed data l...
09/18/2020

### Bounds for Learning Lossless Source Coding

This paper asks a basic question: how much training is required to beat ...
05/07/2020

### Universal Coding and Prediction on Martin-Löf Random Points

We perform an effectivization of classical results concerning universal ...

## I Introduction

Graphical models have a wide range of applications in various fields, such as signal processing, coding theory, and bioinformatics [9]. Inferring the graphical model from data is a starting point in such applications.

In this paper, we learn a probabilistic relation among variables from a data frame that consists of samples

 X(1)=x1,1⋯X(p)=x1,p⋮⋮⋮X(1)=xn,1⋯X(p)=xn,p .⎫⎪ ⎪ ⎪⎬⎪ ⎪ ⎪⎭n×p

The conditional independence relations can generally be expressed by a Bayesian network. However, estimating the optimal structure given the data frame is difficult because more than an exponential number of directed acyclic graphs with respect to

are candidates. In this paper, we assume that the underlying model is a forest rather than a Bayesian network.

The forest learning problem has a long history and has been investigated by several authors. The basic method was considered by Chow and Liu [4]: connect each pair of variables (vertices) as an edge, as long as no loop is generated by the connection, in ascending order of mutual information to obtain a forest. The algorithm assumes that the mutual information values are known, and the resulting tree expresses an approximation of the original distribution. However, we may start from a data frame and connect the edges based on the estimations of mutual information values. Recently, Liu et al. [10] estimated its kernel density to prove its consistency for continuous variables in a high-dimensional setting, and Tan et al. [16] restricted the number of edges to prove consistency for discrete variables in a different high-dimensional setting.

The approach that we take in this paper is essentially different. We note that by adding an edge to a forest without creating any loop, the complexity of the forest increases while the likelihood of the distribution that the forest expresses improves. In this sense, the balance should be considered to obtain a correct forest. In 1993 [12], the author considered a modified estimation of mutual information that takes the balance into account and applied it to the Chow-Liu algorithm. The resulting undirected graph is not a tree but a forest, and the estimation satisfies consistency, avoiding overfitting because it minimizes its description length rather than maximizes the likelihood. Our estimation in this paper is Bayesian and slightly different from [12], although they are essentially equivalent. Specifically, we find a model that maximizes the posterior probability given the data frame when some of the values are missing [11, 8].

In general, it is computationally difficult to obtain the Bayes optimal solution in model selection with incomplete data. In fact, suppose that the variables are binary and that of the values are missing. Thus, we will need to obtain

Bayes scores for each candidate model, where the score is defined by the prior probability of a model multiplied by the conditional probability of the data frame given the model, because the

missing values should be marginalized.

An alternative approach to both reduce the computational effort and obtain a correct model as the sample size increases is to select a model based on the samples such that all the values are available. This method ensures consistency, i.e., a correct model is obtained for large if the size of such samples also becomes large and if an appropriate model selection method is applied to those samples. However, this method excludes samples such that at least one value is missing, and it eventually fails to obtain the Bayes optimal model. In this paper, we assume that the underlying model is a forest rather than a general graphical model to solve such problems.

The remainder of this paper is organized as follows. In Section 2, assuming that no value is missing in a given data frame, we consider a Bayes optimal mutual information estimator to avoid such overfitting. In Section 3, we construct a model selection procedure that maximizes the posterior probability given a data frame that may contain missing values. The computation is at most for variables. A surprising result is that the model that maximizes the posterior probability does not necessarily converge to a correct model as increases for an incomplete data frame. Moreover, we illustrate the theoretical results by showing experiments using the Alarm [1] and Insurance [3] data sets as benchmarks. In Section 4, we evaluate the code length of each data frame and the expected redundancy per sample when some values may be missing, where redundancy is defined by the difference between the expected compression ratio and entropy for a given pair of coding and source. Section 5 summarizes the results and presents directions for future work.

## Ii Forest Learning from Complete Data

We assume that each random variable takes a finite number of values. By a forest, we mean an undirected graph without any loops. If vertex and edge sets are given by

and a subset of , respectively, then we have a distribution in the form

 P′X(X(1),⋯,X(p)):=∏i∈VP(X(i))∏{i,j}∈EP(X(i),X(j))P(X(i))P(X(j)) . (1)

Moreover, if we specify the probabilities and , then the distribution (1) is uniquely determined. For example, although the distributions

 P(X(2))P(X(1)|X(2))P(X(3)|X(2))P(X(5)|X(2)) ⋅ P(X(4)|X(3))P(X(6))P(X(7)|X(6))

and

 P(X(4))P(X(3)|X(4))P(X(2)|X(3))P(X(5)|X(2)) ⋅ (X(1)|X(2))P(X(7))P(X(6)|X(7))

may be expressed by directed acyclic graphs as in Figure 1 (a) and Figure 1 (b), respectively, both can be expressed by the distribution

 P(X(1))P(X(2))P(X(3))P(X(4))P(X(5))P(X(6)) ⋅ P(X(7))⋅P(X(1),X(2))P(X(1))P(X(2))⋅P(X(2),X(3))P(X(2))P(X(3)) ⋅ P(X(2),X(5))P(X(2))P(X(5))⋅P(X(3),X(4))P(X(3))P(X(4))⋅P(X(6),X(7))P(X(6))P(X(7))

and the undirected graph in Figure 1 (c), where the vertices and edges , , , , and correspond to and , respectively.

First, suppose that the distribution

is known. We consider maximizing the Kullback-Leibler divergence

due to approximating the true distribution to a distribution in the form (1):

 D(PX||P′X) (2) = −H(1,⋯,p)+∑i∈VH(i)−∑{i,j}∈EI(i,j) ,

where , , and are the entropies of and , and the mutual information of (), respectively. To minimize , because the first two terms in (2) are constants that do not depend on , we find that maximizing the mutual information sum is sufficient. For this purpose, we apply Kruskal’s algorithm that, given symmetric non-negative weights , , obtains a spanning tree (a connected forest) such that the weight sum is maximized: let be the empty set and at the beginning, and continue to

1. add a pair to with the largest among if connecting them does not cause any loops to be generated, and

2. remove the from (irrespective of whether the is connected)

until is empty. The Chow-Liu algorithm (1968) [4] uses as the weight to minimize . For example, in Figure 2, if , then and are connected at the beginning, but will not be connected because connecting 2 and 3 causes a loop to be generated although the pair has the third largest mutual information value. Furthermore, is to be connected because it has the fourth largest mutual information value and no loop will be generated. The procedure terminates at this point because a loop will be generated if any additional pair of vertices is connected.

Next, we consider the case in which no distribution but only a data frame is given. In this section, we assume that no value is missing in the data frame.

A naive approach to generate a forest is to estimate a mutual information value by the quantity

 In(i,j):=∑x∑yc(x,y)nlogc(x,y)/nc(x)/n⋅c(y)/n (3)

from the occurrences of in the samples and plug into the Chow-Liu algorithm. Although converges to as increases, is always positive; thus, the Chow-Liu algorithm always generates a spanning tree. For example, two variables are to be connected for even if they are independent. This is because maximum likelihood may overfit and in such cases, eventually no consistency is obtained.

In 1993, Suzuki [12] proposed replacing the quantity by another estimation of mutual information

 In(i,j)−12n(α(i)−1)(α(j)−1)logn , (4)

where and are the numbers of values that and take. Later, the same author [13] found that the value of (4) coincides with

 Jn(i,j):=1nlogQn(i,j)Qn(i)Qn(j) (5)

up to terms, where the quantity , which is termed a Bayes measure and is defined later in this section, is computed from samples with respect to () and satisfies

 0≤Qn(i),Qn(j),Qn(i,j)≤1

and

 ∑Qn(i),∑Qn(j),∑Qn(i,j)≤1

for .

This method is similar to in the sense that as , but it also has a property that it takes a negative value for large if and only if and are independent, written as . Moreover, if the prior probability of is , then deciding if and only if

 qQn(i)Qn(j)≥(1−q)Qn(i,j)

maximizes the posterior probability of the decision, which is equivalent to

 Jn(i,j)≤0⟺I(i,j)=0 (6)

when . We can show that the decision (6) is correct for large , which can be stated as follows:

###### Proposition 1 (Suzuki [13])

The decision (6) is true with probability one for large .

We shall now define a Bayes measure. For random variable that takes zero and one, let be the probability of . Then, the probability that independent sequence with ones and zeros occurs can be written as . If the prior density of is available, then by integrating over , we can compute the measure of without assuming any specific :

 Qn(X)=∫10θc(1−θ)n−cw(θ)dθ .

If the weight is in the form

 w(θ)∝θa−1(1−θ)b−1

with , then we obtain the Bayes measure as

 Qn(X)=Γ(a+b)Γ(n+a+b)⋅Γ(c+a)Γ(a)⋅Γ(n−c+b)Γ(b) , (7)

where is the Gamma function . For example, suppose that and . Then, we have

 Qn(X) = 15!⋅(c−12)(c−32)⋯12c ⋅(n−c−12)(n−c−32)⋯125−c = ⎧⎪ ⎪⎨⎪ ⎪⎩63/28,c=0,57/28,c=1,43/28,c=2,3 .

We define quantities similarly to assuming that , , and take values in (), (), and , respectively; the computation for can be extended to those for , , and , respectively. For example, for with , constants and occurrences are replaced by and , respectively, for ; thus, the extended formula can be expressed by [7, 14]

 Qn(X)=Γ(∑xa(x))Γ(∑x(c(x)+a(x)))α−1∏x=0Γ(c(x)+a(x))Γ(a(x)) .

For completeness, we show the proofs of the derivations from (5) to (4) and Proposition 1 in Appendices A and B, respectively.

###### Example 1 (Experiment)

We show box plots that depict the realizations of and when the mutual information values are zero and positive, and we find that is always larger than , which is due to its overfitting (Figure 3). Specifically, cannot detect independence because the value always exceeds zero.

In contrast, Kruskal’s algorithm works even when some weights are either zero or negative: a pair is connected only if the weight is positive. If we apply mutual information values based on (5) rather than (3) to the Chow-Liu algorithm, then it is possible that the value of (5) is negative, which means that the two variables are independent:

###### Proposition 2

A pair of nodes need not be connected even when connecting them does not cause any loops to be generated.

Proof of Proposition 2: Kruskal’s algorithm connects as an edge only nodes with a positive weight.

Hence, the Chow-Liu algorithm based on (5) may terminate before causing overfitting. Moreover, via (1), sequentially choosing an edge with the maximum (5) is equivalent to choosing a forest with the maximum

 Rn(E):=∏i∈VQn(i)∏{i,j}∈EQn(i,j)Qn(i)Qn(j) . (8)

In fact, the first term on the right-hand side of

 −1nlogRn(E)=∑i∈V−1nlogQn(i)−∑{i,j}∈EJn(i,j) (9)

is constant irrespective of , and minimizing and maximizing the sum of values over are equivalent. Therefore, if we prepare the uniform prior over the forests, then the Chow-Liu algorithm based on (5) chooses a forest with the maximum posterior probability given the samples. If the prior probability is given by

 P(E)=K∏{i,j}∈E1−q(i,j)q(i,j) ,

where and , then we can add to (9) and replace (5) with [13]

 Jn(i,j):=1nlog{1−q(i,j)q(i,j)⋅Qn(i,j)Qn(i)Qn(j)} .

Moreover, consistency also holds. In fact, because as , the orders of and asymptotically coincide, and from (6), the timing when the procedure terminates is asymptotically correct.

Comparing the Chow-Liu algorithms based on and that maximize the likelihood and posterior probability, respectively, we find that

1. does not necessarily generate a spanning tree but a forest.

2. the edge set generated by is not necessarily a subset of the spanning tree generated by .

## Iii Forest Learning from Incomplete Data

We consider an extension of the Chow-Liu algorithm based on (5) such that it addresses a data frame that contains missing values. Specifically, we construct quantities and to generalize (5) to respectively obtain

1. a forest that maximizes the posterior probability given a data frame

2. a forest that converges to the true one as increases.

Given a data frame, for each , we compute and based on the samples such that none of the values of and are missing.

First, we arbitrarily choose a root for each connected subgraph of the forest. Let

 [r]:={k∈{1,⋯,n}|xk,r is not missing} ,

and be the Bayes measure with respect to the root for non-missing values . In particular, represents one of the variables, and is obtained via

 Qn(r)=Γ(∑xa(x))∑x(c∗(x)+a(x))∏xc∗(x)+a(x)Γ(a(x)) ,

where is the number of the occurrencies of in , and ranges over the values that takes. Note that with is replace by with because some values are miising.

Because each connected subgraph is a spanning tree, a directed path from its root to each vertex in the subgraph is unique, and a directed edge set is determined from . We arbitrarily fix a directed edge from the upper to the lower , and let

 [i]:={k∈{1,⋯,n}|xk,i is not missing} ,
 [j]:={k∈{1,⋯,n}|xk,j is not missing} ,
 [i,j]:={k∈{1,⋯,n}|neither xk,i nor xk,j are missing} .

Moreover, let , , , , and be the Bayes measures with respect to , , , , and , respectively.

Suppose that we have a data frame consisting of columns in which the values of the two variables are and , where ”*” denotes missing. Then, , , and such that the scores , , , , and are associated with the sequences , , , , and .

Since and depend on and , respectively, rather than on , the Bayes measure with respect to can be evaluated as

 Qn(i,j)⋅Qn(i)Qnj(i)⋅Qn(j)Qni(j)=Qn(i)⋅Qn(i,j)Qnj(i)⋅Qn(j)Qni(j) .

This means that the Bayes measure with respect to the whole forest is evaluated as

 Rn(E) = Qn(r)∏(i,j)∈→E{Qn(i,j)Qnj(i)⋅Qn(j)Qni(j)} (10) = ∏i∈VQn(i)∏{i,j}∈EQn(i,j)Qnj(i)Qni(j) .

Note that does not depend on the edge set and that (8) is a special case of (10).

Since maximizing (10) is equivalent to maximizing the product of in , to maximize the posterior probability when the prior distribution over the forests is uniform, it is sufficient to choose that maximizes

 Jn(i,j):=1nlogQn(i,j)Qnj(i)Qni(j) (11)

among the pairs based on the samples such that none of the values of and are missing. We find that (11) contains (5) as a special case in which no value is missing in the samples for all the pairs. We summarize the above discussion as follows:

###### Theorem 1

If we apply in (11) to the Chow-Liu algorithm, then we obtain a forest with the maximum posterior probability.

Proof of Theorem 1: Maximizing the Bayes measure (10) is equivalent to maximizing (11) at each step of Kruskal’s algorithm. Since an optimal solution results even with a greedy choice of such pairs, we obtain a forest with the maximum posterior probability.

We find that the value of (11) coincides with the mutual information estimator

 Kn(i,j)=1n(i,j)logQn(i,j)Qnj(i)Qni(j) (12)

multiplied by the non-missing ratio , where is the number of non-missing samples for pair , the cardinality of . Therefore, under for some , it is possible that maximizing the mutual information estimator does not necessarily mean maximizing . Then, we have the following propositions:

###### Proposition 3

Suppose that as with probability one for each (). Then,

 Kn(i,j)⟶I(i,j) (13)

as with probability one.

###### Proposition 4

Suppose that as with probability one for each (). Then,

 Jn(i,j)≤0⟺I(i,j)=0⟺Kn(i,j)≤0 (14)

as with probability one.

###### Proposition 5

Suppose that as with probability one for each (). Then, if we apply to the Chow-Liu algorithm, the generated forest is the true one as with probability one.

Proofs of Propositions 3,4, and 5: If no missing value exists in the original ones, from the definitions of and and Proposition 3, we have

 Jn(i,j)=Kn(i,j)≤0⟺I(i,j)=0

as with probability one. The propositions consider an extended case in which some values may be missing. Since the occurrence is independent and identically distributed, evaluates independence based on the non-missing with , where , such that and

 Kn(i,j)≤0⟺I(i,j)=0

as with probability one, which proves Proposition 3. Meanwhile, since , we have

 Kn(i,j)≤0⟺Jn(i,j)≤0 ,

which proves Proposition 4. Proposition 5 occurs because the orders of and asymptotically coincide (Proposition 3) and the timing when the Chow-Liu algorithm terminates is asymptotically correct (Proposition 4).

Finally, we show that Propositions 3 and 5 do not hold for , which means that in model selection with respect to incomplete data, the model that maximizes the posterior probability may not be asymptotically correct.

As an extreme case, if all of the values are missing for among , and , then even if the values of and are large, only will be connected:

###### Proposition 6

Maximizing the posterior probability does not imply asymptotic consistency when selecting models with incomplete data.

We prove the proposition by constructing such a case. In the following example, the values of are not missing with a positive probability:

###### Example 2

We assume that , with , and and are independent. Then, the true forest should be because is . We also find that

 Kn(1,2),Kn(1,3)→1−H(ϵ)

and

 Kn(2,3)→1−H((1−ϵ)2+ϵ2) .

However, if we further assume that is missing with probability

 H((1−ϵ)2+ϵ2)−H(ϵ)1−H(ϵ)<δ<1 (15)

and that no values of and are missing, then we find that asymptotically chooses an incorrect forest because

 Jn(1,2),Jn(1,3)→(1−δ)(1−H(ϵ)) ,
 Jn(2,3)→1−H((1−ϵ)2+ϵ2) ,

and (15) is equivalent to .

Both Chow-Liu algorithms based on and complete in time.

###### Example 3 (Experiments)

For complete data, we used the CRAN package BNSL that was developed by Joe Suzuki and Jun Kawahara [15]. The R package consists of functions that were written using Rcpp [6] and run 50-100 times faster than the usual R functions. For the experiments, we use the data sets Alarm [1] and Insurance [3], which are used often as benchmarks for Bayesian network structure learning.

library(BNSL);
mm=mi_matrix(alarm);
edge.list=kruskal(mm);
g=graph_from_edgelist(edge.list, directed=FALSE);
plot(g,vertex.size=1)


Before execution, the BNSL package should be installed via

install.packages("BNSL")


The functions mi_matrix and kruskal obtain the mutual information value matrix and its edge list obtained by Kruskal’s algorithm, respectively, and the last two lines output the graph using the function plot in the igraph library. For incomplete data, however, we constructed modified functions that realize and for the experiments.

Figure 4 depicts the forests for the complete data with respect to Alarm and Insurance, which contain samples for 37 and 27 variables, respectively (consult references [1] and [3] for the meanings of the variables numbered 1-37 and 1-27), using the functions in the BNSL package. The forests were generated in a few seconds.

Then, we generated forests with respect to Alarm for the first and samples, but the first ten variables out of 37 were missing with probability and . We addressed those data frames using and . Table I shows the entropy of the generated forests (the data set is random due to the noise, and the resulting forest will be random). We can consider that the less the entropy, the more stable the estimation, and we observe that the estimation via is less stable compared with the estimation via for small and large , which appears to be because the sample size decreases from to on average for the first 10 variables, but the estimation of is multiplied by on average even if the estimation is based on the small sample size

; thus, the estimation variance is rather large. However, for large

and small , the estimation via is more correct than the one via .

We expected that for large , the consistent estimation via is closer to the true forest than that via . To examine this expectation, we generated forests using and for and , and the entropy values were 1.798 and 1.195, respectively. Let be the forests as in Figure 5 and Table II. The function chooses the correct edges more often than . We observe that and are almost close except for the edges in subgraph . Because we added noise to the first 10 variables, it is likely that the mutual information estimation between the eighth and ninth variables was underestimated even when is large. Function chose and for 97.5% of the data sets, while chose them for 87 % of the data sets.

## Iv Universal Coding of Incomplete Data

In this section, we consider encoding a data frame.

We claim that the entropy when no value is missing is given by

 H(X):=∑i∈VH(i)−∑{i,j}∈EXI(i,j) , (16)

where is the edge set obtained by applying the Chow-Liu algorithm to the set . In fact, (16) is , where