Towards an information-theory for hierarchical partitions

by   Juan I. Perotti, et al.
Universidad Nacional de Cordoba

Complex systems often require descriptions covering a wide range of scales and organization levels, where a hierarchical decomposition of their description into components and sub-components is often convenient. To better understand the hierarchical decomposition of complex systems, in this work we prove a few essential results that contribute to the development of an information-theory for hierarchical-partitions.



There are no comments yet.



Detectability of hierarchical communities in networks

We study the problem of recovering a planted hierarchy of partitions in ...

Extraction of hierarchical functional connectivity components in human brain using resting-state fMRI

The study of hierarchy in networks of the human brain has been of signif...

Hierarchical Decompositions of dihypergraphs

In this paper we are interested in decomposing a dihypergraph ℋ = (V, ℰ)...

Flattening Multiparameter Hierarchical Clustering Functors

We bring together topological data analysis, applied category theory, an...

Overcoming Hierarchical Difficulty by Hill-Climbing the Building Block Structure

The Building Block Hypothesis suggests that Genetic Algorithms (GAs) are...

An operational information decomposition via synergistic disclosure

Multivariate information decompositions hold promise to yield insight in...

Detailed Derivations of Small-Variance Asymptotics for some Hierarchical Bayesian Nonparametric Models

In this note we provide detailed derivations of two versions of small-va...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The decomposition of a system into its components and sub-components is the essence of reductionism. But the reductionist approach is not easily applicable for complex systems where different emergent patterns at several scales and organization levels are often observed Simon (1962); Barabási (2007). Nevertheless, the convenience of having a hierarchical-partitioning of the system representation is manifested in the various frameworks devised to hierarchically decompose the structure Robinson and Foulds (1981); Ravasz and Barabási (2003); Rosvall and Bergstrom (2011); Queyroi et al. (2013); Peixoto (2014); Tibély et al. (2016); Grauwin et al. (2017) and the behavior Staroswiecki et al. (1983); Crutchfield (1994); Rosvall et al. (2014) of complex systems. In physics, these frameworks include the renormalization group theory of critical phenomena Zinn-Justin (2007) and cluster expansion methods Tanaka (2002). These methods are traditionally used with homogeneous hierarchies, but they have been also applied to heterogeneous cases where finding an appropriate decomposition is difficult Guimerà et al. (2003); Song et al. (2005); Zhou et al. (2006); García-Pérez et al. (2018) and where suitable methods for the comparison of hierarchical-partitions are helpful Perotti et al. (2015); Kheirkhahzadeh et al. (2016); Gates et al. (2019).

The development of methods for comparing hierarchies is generally non-trivial. Different proposals already exist, including tree-edit distance methods Waterman and Smith (1978); Lu (1979); Gordon (1987); Bille (2005); Zhang et al. (2009); Queyroi and Kirgizov (2015), ad-hoc methods Sokal and Rohlf (1962); Fowlkes and Mallows (1983); Gates et al. (2019), and even an information-theoretic method Tibèly et al. (2014). The ad-hoc methods could be useful for applications but often lack a well studied theoretical background. Methods based on tree-edit distances typically are of an algorithmic kind, hence they frequently rely on sub-optimal approximations or only work with fully labeled trees. Similarly, the existing information-theoretic methods cannot work with hierarchical-partitions either.

To fill the requirements of a comparison method for hierarchies that can be associated with a well defined theoretical background, we have recently introduced the Hierarchical Mutual Information (HMI) Perotti et al. (2015). The HMI can be used to compare hierarchical-partitions in an analogous way in which the standard Mutual Information Cover and Thomas (2006) is used to compare flat partitions Danon et al. (2005). The HMI has been proven to be a useful tool for the comparison of detected hierarchical community structures and related problems Yang et al. (2017); Kheirkhahzadeh et al. (2016). Still, many of its theoretical properties remain to be understood.

In this work, we study the theoretical aspects of the HMI and how to exploit them to develop an information-theory for hierarchical-partitions. Specifically, in Sec. II we introduce some preliminary definitions and revisit the HMI. In Sec. III we present the main results. Firstly, we prove some fundamental properties of the HMI. Secondly, we use the HMI to introduce other information-theoretic quantities for hierarchical-partitions, emphasizing the study of the metric properties of the hierarchical generalization of the Variation of Information (VI) and the statistical properties of the hierarchical extension of the Adjusted Mutual Information (AMI). In Sec. IV we discuss some important consequences deriving from the presented results. Finally, in Sec. V we provide a summary of the contributions and corresponding opportunities for future works.

Ii Theory

ii.1 Preliminary definitions

Let denote a directed rooted tree. We say that when is a node of . Let be the set of children of node . If then is a leaf of . Otherwise, it is an internal node of . Let denote the depth or topological distance between and the root of . In particular, if is the root. Let be the set of all nodes of at depth . Clearly . Let be the sub-tree obtained from and its descendants in .

A hierarchical-partition of the universe , the set of the first natural numbers, is defined in terms of a rooted tree and corresponding subsets satisfying

  • for all non-leaf , and

  • for every pair of different .

For every non-leaf , the set represents a partition of , and is the ordinary partition of determined by at depth . Furthermore, is the hierarchical-partition of the universe determined by the tree of root . See Fig. 1 for a schematic representation of a hierarchical-partition of the universe .

Figure 1: Schematic representation of a hierarchical-partition of the universe with root , 5 internal nodes including and and 6 leaves including and . Some leaves may contain more than one element, e.g. . Different leaves may exist at different depths . For instance, leaf is at depth while leaf is at depth . The sub-tree contains the nodes and . The set contains the children and of .

ii.2 The Hierarchical Mutual Information

The Hierarchical Mutual Information (HMI) Perotti et al. (2015) between two hierarchical-partitions and of the same universe reads


where and are the roots of trees and , respectively. Here,


is a recursively defined expression for every pair of nodes and with the same depth

. The probabilities in

are ultimately defined from and the convention . The quantity


represents a mutual information between the standard partitions and restricted to the subset of the universe , and is defined in terms of the three entropies




where the convention is adopted. For details on how to compute these quantities, please check our code Perotti (2020).

Iii Results

For simplicity, we consider hierarchical-partitions and with all leaves at depths . The results can be easily generalized to trees with leaves at different depths at the expense of using more complicated notation.

iii.1 Properties of the HMI

It is convenient to begin rewriting the hierarchical mutual information in the following alternative form, which is more convenient for our purposes (see App. A for a detailed derivation)

where and, in the last two lines we use flat—i.e. standard, non-hierarchical—conditional MIs and entropies of the stochastic variables and for . As the reader can see, then, we have rewritten the HMI as a level by level summatory of conditional MIs.

Starting from Eq. III.1, we prove the following property of the HMI (see App. B for a detailed derivation)


In words, this result states that the HMI between two arbitrary hierarchical-partitions and of the same universe is smaller or equal to the mutual information between and itself—or analogously between and itself—mimicking in this way an analogous property that holds for the flat mutual information Cover and Thomas (2006).

Now we exploit the result of Eq. 8 to show that the HMI can be properly normalized. Namely, if is any generalized mean Bullen (2003)—like the arithmetic-mean

, the geometric-mean

, the max-mean or the min-mean —then the Normalized HMI (NHMI)


satisfies . The first inequality follows because for any . The second follows from Eq. 8.

iii.2 Deriving other information-theoretic quantities

Given the HMI, hierarchical versions of other information-theoretic quantities can be obtained by following the rules of the standard flat case. For example, the hierarchical entropy of a hierarchical-partition can be defined as

where we have used that (see Eq. D). Similarly, we can write down the hierarchical version of the joint entropy as


and the conditional entropy as

Furthermore, we can define a hierarchical version of the Variation of Information (HVI) as

Because of Eq. 8, the properties , and follow, generalizing corresponding properties of the flat case. Unfortunately, we have found counter-examples violating the triangle inequality for the HVI, failing to generalize its flat counterpart in this particular sense Vinh et al. (2009). For instance, for the hierarchical-partitions , and , we find , which is a negative quantity. It is important to remark, however, that the violation of the triangular inequality is relatively weak. For instance, for the maximum difference is is found to be for , and , which is significantly larger than . In fact, as shown in Fig. 2 where the complementary cumulative distribution of differences


is plotted for all , and without repeating the symmetric cases and , and for different sizes , the overall contribution of the negative values is small, not only in magnitude but also in probability. Results for larger values of are not included since the number of triples grows quickly with , turning impractical their exhaustive computation. See App. C for how to generate all possible hierarchical-partitions for a given .

Although the HVI fails to satisfy the triangular inequality, the transformation


of does it (see App. D for a detailed proof). In other words, is a distance metric, so the geometrization of the set of hierarchical-partitions is possible. We confirm this in Fig. 3 by running computations analogous to those of Fig. 2 but for instead of . Notice however that the distance metric is non-universal, because it depends on . In fact, for it holds which is a trivial distance metric—known as the discrete metric—that can only distinguish between equality and non-equality. These properties follow because, for fixed-size , the non-zero ’s are bounded from below by a finite positive quantity that tends to zero when . We also remark that other concave growing functions besides that of Eq. 15 (or more specifically Eq. 25) can be used to obtain essentially the same result; a distance metric.

Although the flat VI is a distance metric—which is a desirable property for the quantitative comparison of entities—it also presents some limitations Fortunato and Hric (2016). Hence, besides the HVI, the HMI, and the NHMI, it is convenient to consider other information-theoretic alternatives for the comparison of hierarchies. This is the case of the Adjusted Mutual Information (AMI) Vinh et al. (2009), which is devised to compensate for the biases that random coincidences produce on the NMI, and which we generalize into the hierarchical case by following the original definition recipe


We called the generalization, the Adjusted HMI (AHMI). The definition of the AHMI requires the definition of a hierarchical version (EHMI)


of the Expected Mutual Information (EMI) Vinh et al. (2009). Here, the distribution represents a reference null model for the randomization of a pair of hierarchical-partitions. Like in the original flat case Vinh et al. (2009), we define the distribution in terms of the well-known permutation model. It is important to remark, however, that other alternatives for the flat case have been recently proposed Newman et al. (2019).

Figure 2: (Color online) Complementary cumulative distribution of inequalities for the Hierarchical Variation of Information for different hierarchy sizes . Negative values exist, breaking triangular inequality, although most of them are positive and over a wider range.
Figure 3: (Color online) Complementary cumulative distribution of inequalities for the distance metric derived from the Hierarchical Variation of Information for different hierarchy sizes . All values are non-negative in agreement with the theory.

To describe the permutation model, let us first introduce some definitions. A permutation is a bijection over . We can define as the hierarchical-partition of the permuted elements where for all . In this way, becomes the partition emerging at depth obtained from the permuted elements.

Now we are ready to define the permutation model for hierarchical-partitions. Consider a pair of permutations and over acting on corresponding hierarchical-partitions and . The permutation model is defined as


In this way, Eq. 17 can be written as

where the simplification can be used because the labeling of the elements in is arbitrary.

The exact computation of Eq. III.2

is expensive, even if the expressions are written in terms of contingency tables and corresponding generalized multivariate two-way hypergeometric distributions. This is because, at variance with the flat case, independence among random variables is compromised. Hence, we approximate the EHMI by sampling permutations

until the relative error of the mean falls below .

Figure 4: (Color online) How similarity by chance affects the Hierarchical Mutual Information . In cyan crosses, values of averaged by sampling pairs of randomly generated hierarchical-partitions and of the universe with elements. In solid magenta circles, the average hierarchical entropy over the sampled s. In open black circles, the Expected Hierarchical Mutual Information (EHMI) averaged over the same pairs of partitions. In solid green squares, the ratio between the first and the second curves. In open blue squares, the EHMI for averaged over . Each point is averaged by sampling 1000 pairs of randomly generated hierarchical-partitions. The EHMI is computed by sampling permutations

until the relative standard error of the mean falls below

Figure 5: (Color online) Average Hierarchical Mutual Information or HMI (solid) and Adjusted Hierarchical Mutual Information or AHMI (dotted) between randomly generated hierarchical-partitions and corresponding hierarchical-partitions obtained from by randomly shuffling the identity of of the elements in . Different symbols represent hierarchical-partitions of different sizes . Each point is averaged over 10,000 samples of . The EHMI within the AHMI is computed as in Fig. 4.

In Fig. 4 we show results concerning how similarities occurring by chance result in non-negligible values of the EHMI for randomly generated hierarchical-partitions. The cyan curve of crosses depicts the average of the HMI between pairs of randomly generated hierarchical-partitions of elements. In App. E we describe the algorithm we use to randomly sample hierarchical-partitions of elements. The previous curve overlaps with the black one of open circles corresponding to the average of the EHMI between the same pairs of randomly generated hierarchical-partitions. This result indicates that the permutation model is a good null model for the comparison of pairs of hierarchical-partitions without correlations. Moreover, these curves exhibit significant positive values, indicating that the HMI detects similarities occurring just by chance between the randomly generated hierarchical-partitions. To determine how significant these values are, the curve of magenta solid circles corresponds to the average of the hierarchical entropies of the generated hierarchical-partitions. As can be seen, the averaged hierarchical entropy lies significantly above the curve of the EHMI. On the other hand, their ratio, which is a quantity in , is over the whole range of studied sizes, as indicated by the green curve of solid squares. In other words, the similarities by chance affect non-negligibly the HMI. The curve of open blue squares depicts the averaged EHMI but for . The curve lies above but follows closely that of the EHMI between different hierarchical-partitions. This indicates that the effect of a randomized structure has a marginal impact besides that of the randomization of labels.

In Fig. 5 we show how the HMI between two hierarchical-partitions and decays with , when is obtained from shuffling the identity of of the elements in . Here, the HMI is averaged by sampling randomly generated hierarchical-partitions at each and . As expected, the EHMI decays as the imposed decorrelation increases. In fact, for but , the obtained values match those of the blue curve of open squares in Fig. 4. In the figure, we also show the AHMI as a function of for the different . Notice how, at difference with the HMI, the AHMI goes from at towards at .

The previous results highlight the importance of the AHMI, in the sense that it conveys as a less biased measure of similarity as compared to the HMI.

Iv Discussion

As we have shown, many similarities subsist between the corresponding flat and hierarchical information-theoretic quantities. Still, we remark that important differences also exist. For instance, according to Eq. III.2, there is no unique hierarchical-partition maximizing the hierarchical entropy. This is because, in the hierarchical version of the entropy, only the standard partition defined at the last level contributes. The contribution of the internal levels produces no effect. This result has important consequences. For example, a hierarchical generalization of the MaxEnt principle Cover and Thomas (2006)

 becomes ill-defined. This issue could be resolved by a slightly different reformulation of the principle. Namely, in the flat case, the standard MaxEnt can be replaced by the maximization of the MI between the distribution being maximized and the uniform distribution, or any other reference distribution that can be chosen depending on the purpose. This alternative reinterpretation of MaxEnt admits a hierarchical generalization through the HMI. Since the standard MaxEnt is broadly applied in physics, our work has the potential to stimulate analogous contributions for the hierarchical case.

Another important difference between the standard and the hierarchical cases is that, while the VI satisfies the triangular inequality, the hierarchical version HVI here presented does not. This may have important consequences for the geometrization of an information-theory for hierarchies. On the other hand, we remember the reader that a transformation of the HVI is found to satisfy the triangular inequality, reason for which the geometrization of a hierarchical information-theory is still possible, although not in a universal way because the transformation is size-dependent.

V Conclusions

In this work, we significantly push forward the formalization of an information-theory for hierarchical-partitions which we have previously introduced Perotti et al. (2015). Specifically, we have shown analytically that the Hierarchical Mutual Information (HMI) generalizes well a well-known inequality of the traditional flat case. Then, we used this result to prove that the HMI admits an appropriate normalization like its flat counterpart, complementing our previous numerically supported conjecture about it. Later, we showed how to use the HMI to derive other information-theoretic quantities, such as the Hierarchical Entropy, the Hierarchical Conditional Entropy, the Hierarchical Variation of Information (HVI) and the Adjusted Hierarchical Mutual Information (AHMI). Finally, we studied the metric properties of the HVI, finding counter-examples violating the triangular inequality, and thus showing that the HVI fails to have the metric property of its flat analogous. On the other hand, we have found a transformation of the HVI that satisfies the metric properties, enabling a geometrization of the presented hierarchical generalization of the traditional information-theory.

Additionally, we have supported the analytical findings with corresponding numerical experiments. We offer open-source access to our code 

Perotti (2020), including the code for the generation of hierarchical-partitions.

Our work opens new possibilities in the study of hierarchically organized physical systems, from the information-theoretic side, the statistical side, as well as from the applications point of view. For instance, it would be interesting to see if the HMI can be used to compute consensus trees out of a given ensemble; a well-known problem within the study of phylogenetic and taxonomic representations in computational biology Holland et al. (2004); Miralles and Vences (2013); Salichos et al. (2014).

Vi Acknowledgments

JIP and NA acknowledge financial support from grants CONICET (PIP 112 20150 10028), FonCyT (PICT-2017-0973), SeCyT–UNC (Argentina) and MinCyT Córdoba (PID PGC 2018). FS acknowledges support from the European Project SoBigData++ GA. 871042 and the PAI (Progetto di Attività Integrata) project funded by the IMT School Of Advanced Studies Lucca. The authors thank CCAD – Universidad Nacional de Córdoba,, which is part of SNCAD – MinCyT, República Argentina, for the provided computational resources.


Appendix A Rewriting the HMI

It is convenient to begin rewriting the hierarchical mutual information in the following alternative form, which is more convenient for our purposes

Here, we have used the definition . Similarly

where we have used that because whenever is not a child of . The entropies in the last two lines are written in terms of the standard non-hierarchical or flat definition, for which


Finally, combining Eqs. A and A we arrive at

Appendix B HMI inequality

The inequality property for the HMI can be straightforwardly proven. Starting from Eq. A, we can write

Here, in the first inequality, we have used a well-known property of the entropy, while in the second inequality, we have used the log-sum inequality Cover and Thomas (2006).

Appendix C Generating hierarchical-partitions

Before showing how to generate all hierarchical-partitions of a set, it is better to review first a way to generate all standard partitions (see Section of Knuth (2011)). Consider we have a way to generate all partitions of the set . Then, we can easily generate all the partitions of the set as follows. For each partition of the set , generate all the partitions that can be obtained by adding the element to each part together with extending the partition with the part . For example, given the partition of , then we generate the partitions , and of . In other words, this algorithm recursively implements induction.

To generate hierarchical-partitions, we follow a similar procedure to the one discussed for standard partitions. Consider we have an algorithm to generate all hierarchical-partitions of . Then, for each hierarchical-partition of , we generate the hierarchical-partitions of that can be obtained by applying the following operations to each of the nodes :

  1. If is a leaf, add to .

  2. If is not a leaf, add the child to with .

  3. Replace by a new node with and as children.

For example, the hierarchical-partitions of are and . Then: Operation 1 applied to the first hierarchical-partition results in . Operation 1 applied to the second results in and . Operation 2 on the second, results in the hierarchical-partitions . Operation 3 on the first, results in . Operation 3 on the second, results in , and . For more details, please check our code for an implementation of the algorithm Perotti (2020).

Appendix D Forcing triangular inequality for the Hierarchical Variation of Information



be defined for some arbitrary . Then, for an appropriate choice of , becomes a distance metric satisfying the triangular inequality. The proof is as follows. First, is clearly a distance since: i) is a growing function of , ii) when and iii) is symmetric in its arguments. It remains to be shown that satisfies the triangular inequality for an appropriate choice of . The triangular inequality for reads

We can show that, for an appropriate choice of , last line is always non-negative given that non-zero values of cannot be arbitrarily small. Thus, let us find a lower bound for the non-zero values of the Variation of Information between hierarchical-partitions. To do so, first, we notice that the Variation of Information between hierarchical-partitions can be decomposed into a summation of non-negative quantities over the different levels. Namely, following Eqs. III.1III.2 and III.2, we can write

with for every due to Eq. 8. Now, if the hierarchical-partitions and are equal up to level included—i.e., as stochastic variables, for all —then