I Introduction
The decomposition of a system into its components and subcomponents is the essence of reductionism. But the reductionist approach is not easily applicable for complex systems where different emergent patterns at several scales and organization levels are often observed Simon (1962); Barabási (2007). Nevertheless, the convenience of having a hierarchicalpartitioning of the system representation is manifested in the various frameworks devised to hierarchically decompose the structure Robinson and Foulds (1981); Ravasz and Barabási (2003); Rosvall and Bergstrom (2011); Queyroi et al. (2013); Peixoto (2014); Tibély et al. (2016); Grauwin et al. (2017) and the behavior Staroswiecki et al. (1983); Crutchfield (1994); Rosvall et al. (2014) of complex systems. In physics, these frameworks include the renormalization group theory of critical phenomena ZinnJustin (2007) and cluster expansion methods Tanaka (2002). These methods are traditionally used with homogeneous hierarchies, but they have been also applied to heterogeneous cases where finding an appropriate decomposition is difficult Guimerà et al. (2003); Song et al. (2005); Zhou et al. (2006); GarcíaPérez et al. (2018) and where suitable methods for the comparison of hierarchicalpartitions are helpful Perotti et al. (2015); Kheirkhahzadeh et al. (2016); Gates et al. (2019).
The development of methods for comparing hierarchies is generally nontrivial. Different proposals already exist, including treeedit distance methods Waterman and Smith (1978); Lu (1979); Gordon (1987); Bille (2005); Zhang et al. (2009); Queyroi and Kirgizov (2015), adhoc methods Sokal and Rohlf (1962); Fowlkes and Mallows (1983); Gates et al. (2019), and even an informationtheoretic method Tibèly et al. (2014). The adhoc methods could be useful for applications but often lack a well studied theoretical background. Methods based on treeedit distances typically are of an algorithmic kind, hence they frequently rely on suboptimal approximations or only work with fully labeled trees. Similarly, the existing informationtheoretic methods cannot work with hierarchicalpartitions either.
To fill the requirements of a comparison method for hierarchies that can be associated with a well defined theoretical background, we have recently introduced the Hierarchical Mutual Information (HMI) Perotti et al. (2015). The HMI can be used to compare hierarchicalpartitions in an analogous way in which the standard Mutual Information Cover and Thomas (2006) is used to compare flat partitions Danon et al. (2005). The HMI has been proven to be a useful tool for the comparison of detected hierarchical community structures and related problems Yang et al. (2017); Kheirkhahzadeh et al. (2016). Still, many of its theoretical properties remain to be understood.
In this work, we study the theoretical aspects of the HMI and how to exploit them to develop an informationtheory for hierarchicalpartitions. Specifically, in Sec. II we introduce some preliminary definitions and revisit the HMI. In Sec. III we present the main results. Firstly, we prove some fundamental properties of the HMI. Secondly, we use the HMI to introduce other informationtheoretic quantities for hierarchicalpartitions, emphasizing the study of the metric properties of the hierarchical generalization of the Variation of Information (VI) and the statistical properties of the hierarchical extension of the Adjusted Mutual Information (AMI). In Sec. IV we discuss some important consequences deriving from the presented results. Finally, in Sec. V we provide a summary of the contributions and corresponding opportunities for future works.
Ii Theory
ii.1 Preliminary definitions
Let denote a directed rooted tree. We say that when is a node of . Let be the set of children of node . If then is a leaf of . Otherwise, it is an internal node of . Let denote the depth or topological distance between and the root of . In particular, if is the root. Let be the set of all nodes of at depth . Clearly . Let be the subtree obtained from and its descendants in .
A hierarchicalpartition of the universe , the set of the first natural numbers, is defined in terms of a rooted tree and corresponding subsets satisfying

for all nonleaf , and

for every pair of different .
For every nonleaf , the set represents a partition of , and is the ordinary partition of determined by at depth . Furthermore, is the hierarchicalpartition of the universe determined by the tree of root . See Fig. 1 for a schematic representation of a hierarchicalpartition of the universe .
ii.2 The Hierarchical Mutual Information
The Hierarchical Mutual Information (HMI) Perotti et al. (2015) between two hierarchicalpartitions and of the same universe reads
(1) 
where and are the roots of trees and , respectively. Here,
(2) 
is a recursively defined expression for every pair of nodes and with the same depth
. The probabilities in
are ultimately defined from and the convention . The quantity(3) 
represents a mutual information between the standard partitions and restricted to the subset of the universe , and is defined in terms of the three entropies
(4) 
(5) 
and
(6) 
where the convention is adopted. For details on how to compute these quantities, please check our code Perotti (2020).
Iii Results
For simplicity, we consider hierarchicalpartitions and with all leaves at depths . The results can be easily generalized to trees with leaves at different depths at the expense of using more complicated notation.
iii.1 Properties of the HMI
It is convenient to begin rewriting the hierarchical mutual information in the following alternative form, which is more convenient for our purposes (see App. A for a detailed derivation)
where and, in the last two lines we use flat—i.e. standard, nonhierarchical—conditional MIs and entropies of the stochastic variables and for . As the reader can see, then, we have rewritten the HMI as a level by level summatory of conditional MIs.
Starting from Eq. III.1, we prove the following property of the HMI (see App. B for a detailed derivation)
(8) 
In words, this result states that the HMI between two arbitrary hierarchicalpartitions and of the same universe is smaller or equal to the mutual information between and itself—or analogously between and itself—mimicking in this way an analogous property that holds for the flat mutual information Cover and Thomas (2006).
Now we exploit the result of Eq. 8 to show that the HMI can be properly normalized. Namely, if is any generalized mean Bullen (2003)—like the arithmeticmean
, the geometricmean
, the maxmean or the minmean —then the Normalized HMI (NHMI)(9) 
satisfies . The first inequality follows because for any . The second follows from Eq. 8.
iii.2 Deriving other informationtheoretic quantities
Given the HMI, hierarchical versions of other informationtheoretic quantities can be obtained by following the rules of the standard flat case. For example, the hierarchical entropy of a hierarchicalpartition can be defined as
where we have used that (see Eq. D). Similarly, we can write down the hierarchical version of the joint entropy as
(11) 
and the conditional entropy as
Furthermore, we can define a hierarchical version of the Variation of Information (HVI) as
Because of Eq. 8, the properties , and follow, generalizing corresponding properties of the flat case. Unfortunately, we have found counterexamples violating the triangle inequality for the HVI, failing to generalize its flat counterpart in this particular sense Vinh et al. (2009). For instance, for the hierarchicalpartitions , and , we find , which is a negative quantity. It is important to remark, however, that the violation of the triangular inequality is relatively weak. For instance, for the maximum difference is is found to be for , and , which is significantly larger than . In fact, as shown in Fig. 2 where the complementary cumulative distribution of differences
(14) 
is plotted for all , and without repeating the symmetric cases and , and for different sizes , the overall contribution of the negative values is small, not only in magnitude but also in probability. Results for larger values of are not included since the number of triples grows quickly with , turning impractical their exhaustive computation. See App. C for how to generate all possible hierarchicalpartitions for a given .
Although the HVI fails to satisfy the triangular inequality, the transformation
(15) 
of does it (see App. D for a detailed proof). In other words, is a distance metric, so the geometrization of the set of hierarchicalpartitions is possible. We confirm this in Fig. 3 by running computations analogous to those of Fig. 2 but for instead of . Notice however that the distance metric is nonuniversal, because it depends on . In fact, for it holds which is a trivial distance metric—known as the discrete metric—that can only distinguish between equality and nonequality. These properties follow because, for fixedsize , the nonzero ’s are bounded from below by a finite positive quantity that tends to zero when . We also remark that other concave growing functions besides that of Eq. 15 (or more specifically Eq. 25) can be used to obtain essentially the same result; a distance metric.
Although the flat VI is a distance metric—which is a desirable property for the quantitative comparison of entities—it also presents some limitations Fortunato and Hric (2016). Hence, besides the HVI, the HMI, and the NHMI, it is convenient to consider other informationtheoretic alternatives for the comparison of hierarchies. This is the case of the Adjusted Mutual Information (AMI) Vinh et al. (2009), which is devised to compensate for the biases that random coincidences produce on the NMI, and which we generalize into the hierarchical case by following the original definition recipe
(16) 
We called the generalization, the Adjusted HMI (AHMI). The definition of the AHMI requires the definition of a hierarchical version (EHMI)
(17) 
of the Expected Mutual Information (EMI) Vinh et al. (2009). Here, the distribution represents a reference null model for the randomization of a pair of hierarchicalpartitions. Like in the original flat case Vinh et al. (2009), we define the distribution in terms of the wellknown permutation model. It is important to remark, however, that other alternatives for the flat case have been recently proposed Newman et al. (2019).
To describe the permutation model, let us first introduce some definitions. A permutation is a bijection over . We can define as the hierarchicalpartition of the permuted elements where for all . In this way, becomes the partition emerging at depth obtained from the permuted elements.
Now we are ready to define the permutation model for hierarchicalpartitions. Consider a pair of permutations and over acting on corresponding hierarchicalpartitions and . The permutation model is defined as
(18) 
In this way, Eq. 17 can be written as
where the simplification can be used because the labeling of the elements in is arbitrary.
The exact computation of Eq. III.2
is expensive, even if the expressions are written in terms of contingency tables and corresponding generalized multivariate twoway hypergeometric distributions. This is because, at variance with the flat case, independence among random variables is compromised. Hence, we approximate the EHMI by sampling permutations
until the relative error of the mean falls below .In Fig. 4 we show results concerning how similarities occurring by chance result in nonnegligible values of the EHMI for randomly generated hierarchicalpartitions. The cyan curve of crosses depicts the average of the HMI between pairs of randomly generated hierarchicalpartitions of elements. In App. E we describe the algorithm we use to randomly sample hierarchicalpartitions of elements. The previous curve overlaps with the black one of open circles corresponding to the average of the EHMI between the same pairs of randomly generated hierarchicalpartitions. This result indicates that the permutation model is a good null model for the comparison of pairs of hierarchicalpartitions without correlations. Moreover, these curves exhibit significant positive values, indicating that the HMI detects similarities occurring just by chance between the randomly generated hierarchicalpartitions. To determine how significant these values are, the curve of magenta solid circles corresponds to the average of the hierarchical entropies of the generated hierarchicalpartitions. As can be seen, the averaged hierarchical entropy lies significantly above the curve of the EHMI. On the other hand, their ratio, which is a quantity in , is over the whole range of studied sizes, as indicated by the green curve of solid squares. In other words, the similarities by chance affect nonnegligibly the HMI. The curve of open blue squares depicts the averaged EHMI but for . The curve lies above but follows closely that of the EHMI between different hierarchicalpartitions. This indicates that the effect of a randomized structure has a marginal impact besides that of the randomization of labels.
In Fig. 5 we show how the HMI between two hierarchicalpartitions and decays with , when is obtained from shuffling the identity of of the elements in . Here, the HMI is averaged by sampling randomly generated hierarchicalpartitions at each and . As expected, the EHMI decays as the imposed decorrelation increases. In fact, for but , the obtained values match those of the blue curve of open squares in Fig. 4. In the figure, we also show the AHMI as a function of for the different . Notice how, at difference with the HMI, the AHMI goes from at towards at .
The previous results highlight the importance of the AHMI, in the sense that it conveys as a less biased measure of similarity as compared to the HMI.
Iv Discussion
As we have shown, many similarities subsist between the corresponding flat and hierarchical informationtheoretic quantities. Still, we remark that important differences also exist. For instance, according to Eq. III.2, there is no unique hierarchicalpartition maximizing the hierarchical entropy. This is because, in the hierarchical version of the entropy, only the standard partition defined at the last level contributes. The contribution of the internal levels produces no effect. This result has important consequences. For example, a hierarchical generalization of the MaxEnt principle Cover and Thomas (2006)
becomes illdefined. This issue could be resolved by a slightly different reformulation of the principle. Namely, in the flat case, the standard MaxEnt can be replaced by the maximization of the MI between the distribution being maximized and the uniform distribution, or any other reference distribution that can be chosen depending on the purpose. This alternative reinterpretation of MaxEnt admits a hierarchical generalization through the HMI. Since the standard MaxEnt is broadly applied in physics, our work has the potential to stimulate analogous contributions for the hierarchical case.
Another important difference between the standard and the hierarchical cases is that, while the VI satisfies the triangular inequality, the hierarchical version HVI here presented does not. This may have important consequences for the geometrization of an informationtheory for hierarchies. On the other hand, we remember the reader that a transformation of the HVI is found to satisfy the triangular inequality, reason for which the geometrization of a hierarchical informationtheory is still possible, although not in a universal way because the transformation is sizedependent.
V Conclusions
In this work, we significantly push forward the formalization of an informationtheory for hierarchicalpartitions which we have previously introduced Perotti et al. (2015). Specifically, we have shown analytically that the Hierarchical Mutual Information (HMI) generalizes well a wellknown inequality of the traditional flat case. Then, we used this result to prove that the HMI admits an appropriate normalization like its flat counterpart, complementing our previous numerically supported conjecture about it. Later, we showed how to use the HMI to derive other informationtheoretic quantities, such as the Hierarchical Entropy, the Hierarchical Conditional Entropy, the Hierarchical Variation of Information (HVI) and the Adjusted Hierarchical Mutual Information (AHMI). Finally, we studied the metric properties of the HVI, finding counterexamples violating the triangular inequality, and thus showing that the HVI fails to have the metric property of its flat analogous. On the other hand, we have found a transformation of the HVI that satisfies the metric properties, enabling a geometrization of the presented hierarchical generalization of the traditional informationtheory.
Additionally, we have supported the analytical findings with corresponding numerical experiments. We offer opensource access to our code
Perotti (2020), including the code for the generation of hierarchicalpartitions.Our work opens new possibilities in the study of hierarchically organized physical systems, from the informationtheoretic side, the statistical side, as well as from the applications point of view. For instance, it would be interesting to see if the HMI can be used to compute consensus trees out of a given ensemble; a wellknown problem within the study of phylogenetic and taxonomic representations in computational biology Holland et al. (2004); Miralles and Vences (2013); Salichos et al. (2014).
Vi Acknowledgments
JIP and NA acknowledge financial support from grants CONICET (PIP 112 20150 10028), FonCyT (PICT20170973), SeCyT–UNC (Argentina) and MinCyT Córdoba (PID PGC 2018). FS acknowledges support from the European Project SoBigData++ GA. 871042 and the PAI (Progetto di Attività Integrata) project funded by the IMT School Of Advanced Studies Lucca. The authors thank CCAD – Universidad Nacional de Córdoba, http://ccad.unc.edu.ar/, which is part of SNCAD – MinCyT, República Argentina, for the provided computational resources.
References
 Simon (1962) H. A. Simon, in Proceedings of the American Philosophical Society (1962) pp. 467–482.
 Barabási (2007) A.L. Barabási, IEEE Control Systems Magazine 27, 33 (2007).
 Robinson and Foulds (1981) D. Robinson and L. Foulds, Mathematical Biosciences 53, 131 (1981).
 Ravasz and Barabási (2003) E. Ravasz and A.L. Barabási, Phys. Rev. E 67, 026112 (2003).
 Rosvall and Bergstrom (2011) M. Rosvall and C. T. Bergstrom, PLOS ONE 6, 1 (2011).
 Queyroi et al. (2013) F. Queyroi, M. Delest, J.M. Fédou, and G. Melançon, Data Mining and Knowledge Discovery 28, 1107 (2013).
 Peixoto (2014) T. P. Peixoto, Phys. Rev. X 4, 011047 (2014).
 Tibély et al. (2016) G. Tibély, D. SousaRodrigues, P. Pollner, and G. Palla, PLOS ONE 11, 1 (2016).
 Grauwin et al. (2017) S. Grauwin, M. Szell, S. Sobolevsky, P. Hövel, F. Simini, M. Vanhoof, Z. Smoreda, A.L. Barabási, and C. Ratti, Scientific Reports 7 (2017).
 Staroswiecki et al. (1983) M. Staroswiecki, V. Toro, and M. Sbai, IFAC Proceedings Volumes 16, 213 (1983), iFAC/IFORS Symposium on Large Scale Systems: Theory and Applications 1983, Warsaw, Poland, 1115 July 1983.
 Crutchfield (1994) J. P. Crutchfield, Physica D: Nonlinear Phenomena 75, 11 (1994).
 Rosvall et al. (2014) M. Rosvall, A. V. Esquivel, A. Lancichinetti, J. D. West, and R. Lambiotte, Nature Communications 5, 4630 (2014).
 ZinnJustin (2007) J. ZinnJustin, Phase Transitions and Renormalisation Group (Oxford Graduate Texts) (Oxford University Press, 2007).
 Tanaka (2002) T. Tanaka, Methods of statistical physics (Cambridge University Press, 2002).
 Guimerà et al. (2003) R. Guimerà, L. Danon, A. DíazGuilera, F. Giralt, and A. Arenas, Phys. Rev. E 68, 065103 (2003).
 Song et al. (2005) C. Song, S. Havlin, and H. A. Makse, Nature 433, 392 (2005).
 Zhou et al. (2006) C. Zhou, L. Zemanová, G. Zamora, C. C. Hilgetag, and J. Kurths, Phys. Rev. Lett. 97, 238103 (2006).
 GarcíaPérez et al. (2018) G. GarcíaPérez, M. Boguñá, and M. Á. Serrano, Nature Physics 14, 583 (2018).
 Perotti et al. (2015) J. I. Perotti, C. J. Tessone, and G. Caldarelli, Physical Review E 92, 062825 (2015).
 Kheirkhahzadeh et al. (2016) M. Kheirkhahzadeh, A. Lancichinetti, and M. Rosvall, Phys. Rev. E 93, 032309 (2016).
 Gates et al. (2019) A. J. Gates, I. B. Wood, W. P. Hetrick, and Y.Y. Ahn, Scientific Reports 9 (2019), 10.1038/s4159801944892y.
 Waterman and Smith (1978) M. Waterman and T. Smith, Journal of Theoretical Biology 73, 789 (1978).
 Lu (1979) S. Lu, IEEE Transactions on Pattern Analysis and Machine Intelligence PAMI1, 219 (1979).
 Gordon (1987) A. D. Gordon, Journal of the Royal Statistical Society: Series A (General) 150, 119 (1987).
 Bille (2005) P. Bille, Theoretical Computer Science 337, 217 (2005).
 Zhang et al. (2009) Q. Zhang, E. Y. Liu, A. Sarkar, and W. Wang, in Scientific and Statistical Database Management, edited by M. Winslett (Springer Berlin Heidelberg, Berlin, Heidelberg, 2009) pp. 517–534.
 Queyroi and Kirgizov (2015) F. Queyroi and S. Kirgizov, Information Processing Letters 115, 689 (2015).
 Sokal and Rohlf (1962) R. R. Sokal and F. J. Rohlf, TAXON 11, 33 (1962).
 Fowlkes and Mallows (1983) E. B. Fowlkes and C. L. Mallows, Journal of the American statistical association 78, 553 (1983).
 Tibèly et al. (2014) G. Tibèly, P. Pollner, T. Vicsek, and G. Palla, PLOS ONE 8, 1 (2014).
 Cover and Thomas (2006) T. M. Cover and J. A. Thomas, Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing) (WileyInterscience, 2006).
 Danon et al. (2005) L. Danon, A. DíazGuilera, J. Duch, and A. Arenas, Journal of Statistical Mechanics: Theory and Experiment 2005, P09008 (2005).
 Yang et al. (2017) Z. Yang, J. I. Perotti, and C. J. Tessone, Phys. Rev. E 96, 052311 (2017).
 Perotti (2020) J. I. Perotti, https://github.com/jipphysics/hit (2020).
 Bullen (2003) P. S. Bullen, Handbook of Means and Their Inequalities (Springer Netherlands, 2003).

Vinh et al. (2009)
N. X. Vinh, J. Epps, and J. Bailey, in
Proceedings of the 26th Annual International Conference on Machine Learning
(ACM, 2009) pp. 1073–1080.  Fortunato and Hric (2016) S. Fortunato and D. Hric, Physics Reports 659, 1 (2016), community detection in networks: A user guide.
 Newman et al. (2019) M. E. J. Newman, G. T. Cantwell, and J.G. Young, “Improved mutual information measure for classification and community detection,” (2019), arXiv:1907.12581 .
 Holland et al. (2004) B. R. Holland, K. T. Huber, V. Moulton, and P. J. Lockhart, Molecular Biology and Evolution 21, 1459 (2004).
 Miralles and Vences (2013) A. Miralles and M. Vences, PLOS ONE 8, 1 (2013).
 Salichos et al. (2014) L. Salichos, A. Stamatakis, and A. Rokas, Molecular Biology and Evolution 31, 1261 (2014).
 Knuth (2011) D. E. Knuth, The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1, 4th ed. (AddisonWesley Professional, 2011).
 Meilă (2007) M. Meilă, Journal of Multivariate Analysis 98, 873 (2007).
Appendix A Rewriting the HMI
It is convenient to begin rewriting the hierarchical mutual information in the following alternative form, which is more convenient for our purposes
Here, we have used the definition . Similarly
where we have used that because whenever is not a child of . The entropies in the last two lines are written in terms of the standard nonhierarchical or flat definition, for which
(22) 
Finally, combining Eqs. A and A we arrive at
Appendix B HMI inequality
Appendix C Generating hierarchicalpartitions
Before showing how to generate all hierarchicalpartitions of a set, it is better to review first a way to generate all standard partitions (see Section 7.2.1.7 of Knuth (2011)). Consider we have a way to generate all partitions of the set . Then, we can easily generate all the partitions of the set as follows. For each partition of the set , generate all the partitions that can be obtained by adding the element to each part together with extending the partition with the part . For example, given the partition of , then we generate the partitions , and of . In other words, this algorithm recursively implements induction.
To generate hierarchicalpartitions, we follow a similar procedure to the one discussed for standard partitions. Consider we have an algorithm to generate all hierarchicalpartitions of . Then, for each hierarchicalpartition of , we generate the hierarchicalpartitions of that can be obtained by applying the following operations to each of the nodes :

If is a leaf, add to .

If is not a leaf, add the child to with .

Replace by a new node with and as children.
For example, the hierarchicalpartitions of are and . Then: Operation 1 applied to the first hierarchicalpartition results in . Operation 1 applied to the second results in and . Operation 2 on the second, results in the hierarchicalpartitions . Operation 3 on the first, results in . Operation 3 on the second, results in , and . For more details, please check our code for an implementation of the algorithm Perotti (2020).
Appendix D Forcing triangular inequality for the Hierarchical Variation of Information
Let
(25) 
be defined for some arbitrary . Then, for an appropriate choice of , becomes a distance metric satisfying the triangular inequality. The proof is as follows. First, is clearly a distance since: i) is a growing function of , ii) when and iii) is symmetric in its arguments. It remains to be shown that satisfies the triangular inequality for an appropriate choice of . The triangular inequality for reads
We can show that, for an appropriate choice of , last line is always nonnegative given that nonzero values of cannot be arbitrarily small. Thus, let us find a lower bound for the nonzero values of the Variation of Information between hierarchicalpartitions. To do so, first, we notice that the Variation of Information between hierarchicalpartitions can be decomposed into a summation of nonnegative quantities over the different levels. Namely, following Eqs. III.1, III.2 and III.2, we can write
with for every due to Eq. 8. Now, if the hierarchicalpartitions and are equal up to level included—i.e., as stochastic variables, for all —then
because
Comments
There are no comments yet.