Information Content of a Phylogenetic Tree in a Data Matrix
Phylogenetic trees in genetics and biology in general are all binary. We make an attempt to answer one fundamental question: Is such binary branching from the coarsest to the finest scales sustained by data? We convert this question into an equivalent one: where is the structural information of tree in a data matrix? Results from this conceptual as well as computing issue afford us to conclude a negative answer: Each branch being split into two at each inter-node of tree from the top to bottom levels is a man-made structure. The data-driven computing paradigm Data Mechanics is employed here to reveal that information of tree is composed of a set of selected temperatures (or scales), each of which has a clustering composition strictly regulated by a temperature-specific cluster-sharing probability matrix. The resultant Data Cloud Geometry (DCG) tree on the space of species is proposed as the authentic structure contained in data. Particularly each core clusters on the finest scale, the bottom level, of DCG tree should not be further partitioned because of uniformity. Beyond the finest scale, the branching of DCG tree is primarily based on probability, which induces an Ultrametric satisfying super triangular inequality property. This Ultrametric property differentiates DCG tree from all popular trees based on Hierarchical clustering (HC) algorithm, which typically employs an empirical, often ad hoc distance measure. Since this measure is regulated by the triangular inequality, it is not capable of producing a "flat" branch, in which all its members (more than two) have equal distances to each others. We demonstrate such information content on an illustrative zoo data first, and then on two genomic data.
READ FULL TEXT