    # Scaling of Model Approximation Errors and Expected Entropy Distances

We compute the expected value of the Kullback-Leibler divergence to various fundamental statistical models with respect to canonical priors on the probability simplex. We obtain closed formulas for the expected model approximation errors, depending on the dimension of the models and the cardinalities of their sample spaces. For the uniform prior, the expected divergence from any model containing the uniform distribution is bounded by a constant 1-γ, and for the models that we consider, this bound is approached if the state space is very large and the models' dimension does not grow too fast. For Dirichlet priors the expected divergence is bounded in a similar way, if the concentration parameters take reasonable values. These results serve as reference values for more complicated statistical models.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Let

be probability distributions on a finite set

. The information divergence, relative entropy, or Kullback-Leibler divergence

 D(p∥q)=∑i∈Xpilogpiqi

is a natural measure of dissimilarity between and . It specifies how easily the two distributions can be distinguished from each other by means of statistical experiments. In this paper we use the natural logarithm. The divergence is related to the log-likelihood: If is an empirical distribution, summarizing the outcome of statistical experiments, then the log-likelihood of a distribution equals , where is the Shannon entropy of . Hence finding a

maximum likelihood estimate

within a set of probability distributions is the same as finding a minimizer of the divergence with restricted to .

Assume that is a set of probability distributions for which there is no simple mathematical description available. We would like to identify a model which does not necessarily contain all distributions from , but which approximates them relatively well. What error magnitude should we accept from a good model?

To assess the expressive power of a model , we study the function . Finding the maximizers of this function corresponds to a worst-case analysis. The problem of maximizing the divergence from a statistical model was first posed in 

, motivated by infomax principles in the context of neural networks. Since then, important progress has been made, especially in the case of exponential families

[5, 4, 8]

, but also in the case of discrete mixture models and restricted Boltzmann machines

.

In addition to the worst-case error bound, the expected performance and expected error are of interest. This leads to the mathematical problem of computing the expectation value

 ⟨D(p∥M)⟩=∫ΔD(p∥M)ψ(p)dp, (1)

where is drawn from a prior probability density on the probability simplex . The correct prior depends on the concrete problem at hand and is often difficult to determine. We ask: Given conditions on the prior, how different is the worst case from the average case? To what extent can both errors be influenced by the choice of the model? We focus on the case of Dirichlet priors. It turns out that in most cases the worst-case error diverges as the number of elementary events tends to infinity, while the expected error remains bounded. Our analysis leads to integrals that have been considered in Bayesian function estimation in , and we can take advantage of the tools developed there.

Our first observation is that, if is the uniform prior, then the expected divergence from to the uniform distribution is a monotone function of the system size and converges to the constant as , where is the Euler-Mascheroni constant. Many natural statistical models contain the uniform distribution; and the expected divergence from such models is bounded by the same constant. On the other hand, when and are chosen uniformly at random, the expected divergence is equal to .

We show, for a class of models including independence models, partition models, mixtures of product distributions with disjoint supports , and decomposable hierarchical models, that the expected divergence actually has the same limit, , provided the dimension of the models remains small with respect to  (the usual case in applications). For Dirichlet priors the results are similar (for reasonable choices of parameters). In contrast, when is an exponential family, the maximum value of is at least , see .

In Section 2 we define various model classes and collect basic properties of Dirichlet priors. Section 3 contains our main results: closed-form expressions for the expectation values of entropies and divergences. The results are discussed in Section 4. Proofs and calculations are deferred to Appendix A.

## 2 Preliminaries

### 2.1 Models from statistics and machine learning

We consider random variables on a finite set of elementary events

, . The set of probability distributions on is the -simplex . A model is a subset of . The support sets of a model are the support sets of points in .

The -mixture of a model is the union of all convex combinations of any of its points: . Given a partition of into disjoint support sets of , the -mixture of with disjoint supports is the subset of defined by

 Mϱ={K∑i=1λip(i)∈MK∣∣ ∣∣p(i)∈M,supp(p(i))⊆Ai for all i}.

Let be a partition of . The partition model consists of all that satisfy whenever belong to the same block in the partition . Partition models are closures of convex exponential families with uniform reference measures. The closure of a convex exponential family is a set of the form (see )

 Mϱ,ν={K∑k=1λk1Akνν(Ak)∣∣ ∣∣λk≥0,K∑k=1λk=1},

where is a positive function on called reference measure, and is the indicator function of . Note that all measures with fixed conditional distributions on , for all , yield the same model. In fact, is the -mixture of the set .

For a composite system with variables , the set of elementary events is , for all . A product distribution is a distribution of the form

 p(x1,…,xn)=p{1}(x1)⋯p{n}(xn)for % all x∈X,

where . The independence model is the set of all product distributions on . The support sets of the independence model are the sets of the form with for each .

Let be a simplicial complex on . The hierarchical model consists of all probability distributions that have a factorization of the form , where is a positive function that depends only on the -coordinates of . The model is called reducible if there exist simplicial subcomplexes such that and is a simplex. In this case, the set is called a separator. Furthermore, is decomposable if it can be iteratively reduced into simplices. Such an iterative reduction can be described by a junction tree, which is a tree with vertex set the set of facets of and with edge labels the separators. The independence model is an example of a decomposable model. We give another example in Fig. 1 and refer to  for more details. In general, the junction tree is not unique, but the multi-set of separators is unique.

For most models there is no closed-form expression for , since there is no closed formula for . However, for some of the models mentioned above a closed formula does exist:

The divergence from the independence model is called multi-information and satisfies

 MI(X1,…,Xn)=D(p∥M1)=−H(X1,…,Xn)+n∑k=1H(Xk). (2)

If it is also called the mutual information of and . The divergence from equals (see [4, eq. (1)])

 D(p∥Mϱ,ν)=D(p∥K∑k=1p(Ak)ν(x|Ak)). (3)

For a decomposable model with junction tree ,

 D(p∥MS)=∑S∈VHp(XS)−∑S∈EHp(XS)−H(p). (4)

Here, denotes the joint entropy of the random variables under .

### 2.2 Dirichlet prior

The Dirichlet distribution (or Dirichlet prior) with concentration parameter , is the probability distribution on with density for all , where is the gamma function. We write .

Note that is the uniform probability density on . Furthermore, note that is uniformly concentrated in the point measures (it assigns mass to , for all ), and is concentrated in the uniform distribution . In general, if , then is the Dirac delta concentrated on .

A basic property of the Dirichlet distributions is the aggregation property: Consider a partition of . If , then , see, e.g., . We write , for the concentration parameter induced by the partition .

The aggregation property is useful when treating marginals of composite systems. Given a composite system with , , , we write , for the concentration parameter of the Dirichlet distribution induced on the -marginal .

## 3 Expected entropies and divergences

For any let be the th harmonic number. It is known that for large ,

 h(k)=log(k)+γ+O(1k),

where is the Euler-Mascheroni constant. Moreover, is strictly positive and decreases monotonically. We also need the natural analytic extension of to the non-negative reals, given by , where is the gamma function.

The following theorems contain formulas for the expectation value of the divergence from the models defined in the previous section, as well as asymptotic expressions of these formulas. The results are based on explicit solutions of the integral (1), as done by Wolpert and Wolf in . The proofs are contained in Appendix A.

###### Theorem 1.

If , then:

• ,

• .

In the symmetric case ,

 ∙⟨H(p)⟩ =h(Na)−h(a) =⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩log(Na)−h(a)+γ+O(1/Na) for large N % and const. alog(N)+O(1/a) for large a and arb. NO(Na) as a→0 with bounded N h(c)+O(a) as a→0 with Na=c
 ∙⟨D(p∥u)⟩ =log(N)−h(Na)+h(a) =⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩h(a)−log(a)−γ+O(1/Na) for large N % and const. aO(1/a) for large a and arb. N log(N)+O(Na) as a→0 with bounded Nlog(N)−h(c)+O(a) as a→0 with Na=c.

The entropy is maximized by the uniform distribution , which satisfies . For large , or , the average entropy is close to the maximum value. It follows that in these cases the expected divergence from the uniform distribution  remains bounded. The fact that the expected entropy is close to the maximal entropy makes it difficult to estimate the entropy. See  for a discussion.

###### Theorem 2.a.

For any , if , then

 ⟨D(p∥q)⟩p =N∑i=1αiα(h(αi)−log(qi))−h(α) =D(αα∥q)+O(N/α).

If , then

 ⟨D(p∥q)⟩p =D(u∥q)+h(a)+log(N)−h(Na) =D(u∥q)+(h(a)−log(a))−γ+O(1/(Na)).
###### Theorem 2.b.

For any , if , then

 ⟨D(p∥q)⟩q =N∑i=1pi(log(pi)−h(αi−1))+h(α−1). If αi>1 for all i, then ⟨D(p∥q)⟩q =D(p∥αα)+N∑i=1O(1/(αi−1)).
###### Theorem 2.c.

When and , then

• ,

• .

If , then .

Consider a sequence of distributions , . As , the expected divergence with respect to the uniform prior is bounded from above by , if and only if . It is easy to see that whenever satisfies for all . Therefore, the expected divergence is unbounded as tends to infinity only if the sequence accumulates at the boundary of the probability simplex. In fact, whenever is in the subsimplex . The relative Lebesgue volume of this subsimplex in is .

For arbitrary Dirchlet priors (depending on ), the expectation value remains bounded in the limit if remains bounded and if is bounded from below by a positive constant for all .

If , then the expected divergence remains bounded in the limit , provided is bounded from below by a positive constant.

###### Theorem 3.

For a system of random variables with joint probability distribution , if , then

• ,

• .

In the symmetric case ,

• ,

• .

If, moreover, is large for all (this happens, for example, when remains bounded from below by some and (i) all become large, or (ii) all are bounded and becomes large), then

• ,

• .

If is large for all , then the expected entropy of a subsystem is also close to its maximum, and hence the expected multi-information is bounded. This follows also from the fact that the independence model contains the uniform distribution, and hence .

###### Theorem 4.

Let be a partition of into sets of cardinalities , and let be a reference measure on . If , then

 ⟨D(p∥Mϱ,ν)⟩ =N∑i=1αiα(h(αi)−log(νi))−K∑k=1αϱkα(h(αϱk)−log(ν(Ak))),

where . If , and (wlog) ,

 ⟨D(p∥Mϱ,ν)⟩ =h(a)−K∑k=1LkN(h(Lka)−log(Lk))+D(u∥ν).

If furthermore , then

 ⟨D(p∥Mϱ,ν)⟩=h(a)−log(a)−γ+D(u∥ν)+O(1/N).

If , then is a partition model and contains the uniform distribution. Therefore, the expected divergence is again bounded. In contrast, the maximal divergence is . The result for mixtures of product distributions of disjoint supports is similar:

###### Theorem 5.

Let be the joint state space of variables, , . Let be a partition of into support sets , of the independence model, and let be the model containing all mixtures of product distributions with .

• If , then

 ⟨D(p∥Mϱ1)⟩=N∑i=1αiα(h(αi)−h(α))+K∑k=1(|Gk|−1)αϱkα(h(αϱk)−h(α))−K∑k=1∑j∈Gk∑xj∈Xj,kαk,xjα(h(αk,xj)−h(α)),

where , , and is the set of variables that take more than one value in the block .

• Assume that the system is homogeneous, for all , and that is a cylinder set of cardinality , , for all . If , then

 ⟨D(p∥Mϱ1)⟩=h(a)+K∑k=1Nmk−n1((mk−1)h(Nmk1a)−mkh(Nmk−11a)).
• If is large for all , then

 ⟨D(p∥Mϱ1)⟩=h(a)−log(a)−γ+O(maxkmkNmk−11a).

The -mixture of binary product distributions with disjoint supports is a submodel of the restricted Boltzmann machine model with hidden units, see . Hence Theorem 5 also gives bounds for the expected divergence from restricted Boltzmann machines.

###### Theorem 6.

Consider a decomposable model with junction tree . If , then

 ⟨D(p∥MS)⟩=−∑S∈V∑j∈XSαSjαh(αSj)+∑S∈E∑j∈XSαSjαh(αSj)+(|V|−|E|−1)h(α)+N∑i=1αiαh(αi),

where for . If is drawn uniformly at random, then

 ⟨D(p∥MS)⟩=∑S∈V(h(N)−h(N/NS))−∑S∈E(h(N)−h(N/NS))−h(N)+1.

If is large for all , then

 ⟨D(p∥MS)⟩=1−γ+O(maxSN/NS).

## 4 Discussion

In the previous section we have shown that the values of are very similar for different models in the limit of large , provided the Dirichlet concentration parameters remain bounded and the model remains small. In particular, if for all , then for large holds for

, for independence models, for decomposable models, for partition models, and for mixtures of product distributions on disjoint supports (for reasonable values of the hyperparameters

and ). Some of these models are contained in each other, but nevertheless, the expected divergences do not differ much. The general phenomenon seems to be the following:

• If is large and if is low-dimensional, then the expected divergence is , when is uniformly distributed on .

Of course, this is not a mathematical statement, because it is easy to construct counter-examples: Space-filling curves can be used to construct one-dimensional models with an arbitrarily low value of (for arbitrary ). However, we expect that the statement is true for most models that appear in practice. In particular, we conjecture that the statement is true for restricted Boltzmann machines.

In Theorem 4, if , then the expected divergence from a convex exponential family is minimal, if and only if . In this case is a partition model. We conjecture that partition models are optimal among all (closures of) exponential families in the following sense:

• For any exponential family there is a partition model of the same dimension such that , when .

The statement is of course true for zero-dimensional exponential families, which consist of a single distribution. The conjecture is related to the following conjecture from :

• For any exponential family there is a partition model of the same dimension such that .

### Computations

Our findings may be biased by the fact that all models treated in Section 3 are exponential families. As a slight generalization we did computer experiments with a family of models which are not exponential families, but unions of exponential families.

Let be a family of partitions of , and let

be the union of the corresponding partition models. We are interested in these models, because they can be used to study more difficult models, like restricted Boltzmann machines and deep belief networks. Figure

2 compares a single partition model on three states with the union of all partition models. Figure 2: From left to right: Divergence to a partition model with two blocks on X={1,2,3}. Same, multiplied by a symmetric Dirichlet density with parameter a=5. Divergence to the union of the three partition models with two blocks on X={1,2,3}. Same, multiplied by the symmetric Dirichlet density with a=5. The shading is scaled on each image individually.

For a given and let be the set of all partitions of into two blocks of cardinalities and . For different values of and  we computed for distributions sampled from , for distributions sampled from , for distributions sampled from (for only samples; in this case there are as many as homogeneous bipartitions). The results are shown in Figure 3. Figure 3: Expected divergence (numerically) from MΥk with respect to Dir(a,…,a), for different system sizes N and values of a. Left: The case k=1. The y-ticks are located at h(a)−log(a)−γ, which are the limits of the expected divergence from single bipartition models, see Theorem 4. Middle: The case k=2. The peak at N=4 emerges, because in this case there are only 3 different partitions, instead of (42). The dashed plot indicates corresponding results from the left figure. Right: The expected divergence to the union of all (NN/2)/2 bipartition models with two blocks of cardinalities N/2, for even N.

In the first two cases the expected divergence seems to tend to the asymptotic value of . Observe that , unless . Intuitively this makes sense for two reasons: First, for and , using Theorem 4 one can show that ; and second, the cardinality of is much larger than the cardinality of if . For small values of this intuition may not always be correct. For example, for , the expected divergence from is larger than the one from , although in this case and , see Figure 3 right.

We expect that, for large , it is possible to make much smaller than by choosing . In this case, the model has (Hausdorff) dimension only one, but it is a union of exponentially many one-dimensional exponential families.

## Appendix A Proofs

The analytic formulas in Theorem 1 are [10, Theorem 7]. The asymptotic expansions are direct.

The proof of Theorem 2.a makes use of the following Lemma, which is a consequence of [10, Theorem 5] and the aggregation property of the Dirichlet distribution:

###### Lemma 7.

Let be a partition of , let be positive real numbers, and let for . Then

 ∫ΔN−1(∑i∈Akpi)log(∑i∈Akpi)N∏i=1pαi−1idp= ∫ΔK−1p∗klog(p∗k)K∏k′=1(p∗k′)αk′−1dp∗ = αk∏Kk′=1Γ(αk′)Γ(α+1)(h(αk)−h(α)).

Let . Theorem 2.a follows from [10, Theorem 3]:

 ∫ΔN−1piN∏j=1pnjjdp/∫ΔN−1N∏j=1pnjjdp=(ni+1)∏Nj=1Γ(nj+1)Γ(N+n+1)/∏Nj=1Γ(nj+1)Γ(N+n)=(ni+1)(N+n),

and . By Lemma 7,

 ∫ΔN−1log(pi)N∏j=1pnjjdp/∫ΔN−1N∏j=1pnjjdp=h(ni)−h(N+n−1),

and this implies Theorems 2.b and 2.c.

Theorem 3 is a corollary to Theorem 1, the aggregation property of the Dirichlet priors and the formula (2) for the multi-information. Theorem 4 follows from eq. (3), and Theorem 6 follows from eq. (4). Similarly, Theorem 5 follows from the equality

 D(p∥Mϱ1)=K∑i=1∑x∈Aip(x)logp(x)p(Ai)n−1∏nj=1(∑y∈Ai:yj=xjp(y)),

which can be derived as follows: The unique solution satisfies and .