Tropical Foundations for Probability & Statistics on Phylogenetic Tree Space

05/31/2018
by   Bo Lin, et al.
Columbia University
0

We introduce a novel framework for the statistical analysis of phylogenetic trees: Palm tree space is constructed on principles of tropical algebraic geometry, and represents phylogenetic trees as a point in a space endowed with the tropical metric. We show that palm tree space possesses a variety of properties that allow for the definition of probability measures, and thus expectations, variances, and other fundamental statistical quantities. This provides a new, tropical basis for a statistical treatment of evolutionary biological processes represented by phylogenetic trees. In particular, we show that a geometric approach to phylogenetic tree space --- first introduced by Billera, Holmes, and Vogtmann, which we reinterpret in this paper via tropical geometry --- results in analytic, geometric, and topological characteristics that are desirable for probability, statistics, and increased computational efficiency.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

07/23/2012

Towards a theory of statistical tree-shape analysis

In order to develop statistical methods for shapes with a tree-structure...
01/07/2021

The Geometry of the space of Discrete Coalescent Trees

Computational inference of dated evolutionary histories relies upon vari...
07/09/2021

Staged tree models with toric structure

A staged tree model is a discrete statistical model encoding relationshi...
10/17/2021

On the Statistical Analysis of Complex Tree-shaped 3D Objects

How can one analyze detailed 3D biological objects, such as neurons and ...
03/29/2013

Geometric tree kernels: Classification of COPD from airway tree geometry

Methodological contributions: This paper introduces a family of kernels ...
11/01/2021

Extended probabilities in Statistics

We propose a new, more general definition of extended probability measur...
07/11/2018

Geometric comparison of phylogenetic trees with different leaf sets

The metric space of phylogenetic trees defined by Billera, Holmes, and V...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

We introduce a novel framework for the statistical analysis of phylogenetic trees: Palm tree space is constructed on principles of tropical algebraic geometry, and represents phylogenetic trees as a point in a space endowed with the tropical metric. We show that palm tree space possesses a variety of properties that allow for the definition of probability measures, and thus expectations, variances, and other fundamental statistical quantities. This provides a new, tropical basis for a statistical treatment of evolutionary biological processes represented by phylogenetic trees. In particular, we show that a geometric approach to phylogenetic tree space — first introduced by Billera, Holmes, and Vogtmann, which we reinterpret in this paper via tropical geometry — results in analytic, geometric, and topological characteristics that are desirable for probability, statistics, and increased computational efficiency.

Keywords:

BHV tree space; phylogenetic trees; tropical line segments; tropical metric; tropical statistics.

1 Introduction

In this paper, we provide the rigorous foundations for probabilistic and statistical formalism on the space of phylogenetic trees, which we reinterpret via a construction based on algebraic geometry. Specifically, we show that the tropical geometric space of phylogenetic trees is endowed with a well-defined metric. We give the specifics of topological and geometric quantities in this space, such as the characterization of open balls and line segments; we also study the behavior of tree combinatorial types within this space. We prove that the tropical geometric space of phylogenetic trees is complete and separable under this metric, and that there exist compact subspaces. These properties allow the existence of probability measures as well as the definition of expectations and variances. These results formalize the theory of mathematical phylogenetics for probability and statistical analysis via tropical geometry.

Phylogenetic trees are the fundamental mathematical representation of evolutionary processes. They model many important and diverse biological phenomena, such as speciation, the spread of pathogens, as well as the evolution of cancer. The field of mathematical phylogenetics is a well-studied discipline, and the area of quantitative statistics for spaces of phylogenetic trees has been under active research for several decades for two important reasons: First, explicit computations directly on phylogenetic tree spaces are challenging due to extremely high dimensionality when working with hundreds of thousands of trees on species. Second, standard statistical methodology and quantitative algorithms are not directly applicable on phylogenetic tree spaces due to their non-Euclidean nature, with the trees themselves being discrete, geometric structures. Significant previous work exists to address various classical statistical interests, such as the calculation of medians and confidence bands of phylogenetic trees (e.g. 

Felsenstein (1985); Barthélemy and McMorris (1986)). The geometry of the space of phylogenetic trees, where each tree is represented as a point, was studied extensively by Billera et al. (2001). This perspective became a fundamental breakthrough for quantitative work on phylogenetic trees via the Billera–Holmes–Vogtmann (BHV) tree space. Indeed, subsequent work built on this geometry and gave way to statistical methodology, such as clustering and hypothesis testing (Holmes, 2005; Chakerian and Holmes, 2012)

, principal component analysis

(Nye, 2011), and dimensionality reduction via subsampling (Zairis et al., 2016). Theoretical statistical results formalizing the limiting behavior of Fréchet means also make use of the BHV geometry of tree space (Barden et al., 2018). BHV tree space remains an object of active research interest in mathematical phylogenetics.

Tropical geometry can be viewed as a subdiscipline of algebraic geometry, where arithmetic evaluations are based on a specific algebraic semiring known as the tropical semiring. Speyer and Sturmfels (2004) proved the existence of a homeomorphism between the space of phylogenetic trees with leaves and a tropical construction of the Grassmannian. This result provides the first formal connection between tropical geometry and mathematical phylogenetics: it endows the space of phylogenetic trees with a tropical structure and thus, allows for a tropical coordinatization. However, despite this coincidence between tropical geometry and phylogenetic trees, there has been very little previous work that explores tropical geometry in statistical inference on phylogenetic trees. In existing work involving tropical geometry and statistical inference, Pachter and Sturmfels (2004) formulate the use of tropical geometry in statistics of graphical models as algebraic varieties, while Monod et al. (2017) demonstrate its use in constructing sufficient statistics in topological data analysis. Very recently, Yoshida et al. (2017) introduced a method for principal component analysis — an important statistical technique for dimension reduction and visualizing data, thus, for exploratory data analysis — on phylogenetic tree space that uses tropical geometry, which suggests the potential for the study of tropical statistics of phylogenetic trees (Lin et al., 2017).

The main contribution of this work is a formal theoretical basis and rigorous foundations to enable statistical analysis on the space of phylogenetic trees via tropical geometry. In particular, we endow the space of phylogenetic trees endowed with the tropical metric, and study in detail this tropical moduli space, which we refer to as palm tree space. We show that palm tree space is a Polish space, i.e. that it is a separable and completely metrizable topological space. The fact that it is Polish allows us to prove the existence of properties and quantities that are fundamental for statistics, namely, probability measures, means and variances. This ensures that statistical questions are well-defined. We also study geometric and topological features of palm tree space, and show that it exhibits properties that are desirable for analytics and computation. Our work is in part motivated by the geometric study of BHV space, but differs in the algebraic geometric flavor lent by tropical geometry, which results in certain properties that are more natural for statistical analysis. In particular, the tropical metric allows for a tropical version of linear algebra on tree space; linear algebra is the foundation of classical statistics. This paper therefore proposes a tropical framework as a novel geometric setting for studying the statistics of phylogenetic evolution in computational biology. More generally, the results we prove in this paper provide a basis for a general study of tropical statistics. Such a study suggests a new, geometric direction within the larger field of algebraic statistics, where techniques from computational algebra are leveraged in solutions to statistical problems.

The remainder of this paper is organized as follows. In Section 2, we provide the mathematical setting for our contributions. Specifically, we define and discuss the relationship between phylogenetic trees and metrics, in the context of tropical geometry. In Section 3, we introduce and provide formal properties of the tropical metric on the space of phylogenetic trees, and formally define palm tree space. We give details on the geometry and topology of palm tree space, and also prove analytic properties that make it Polish, making it amenable to statistical analysis. Section 4 gives a detailed treatment of the special case of equidistant trees and studies the behavior of their tree topologies in the setting of tropical line segments. Section 5 discusses and gives details of the tropical probabilistic quantities that are fundamental in statistical analysis and inference. We close in Section 6 with a discussion, and some ideas for future research.

2 Tropical Geometry & Tree Metrics

In this section, we begin by introducing the essentials of tropical geometry for our study. We then define discuss in detail the relationship between phylogenetic trees and metrics, and in particular, how phylogenetic trees may be represented by metrics. We close this section with a brief review of some important and commonly-occurring tree metrics, in particular, the BHV tree metric.

2.1 Fundamentals of Tropical Geometry

Tropical algebra is a branch of abstract algebra based on specific semirings, which we will now define. The extension of tropical algebra to tropical geometry is a subdiscipline of algebraic geometry, which studies geometric properties of the zeros of systems of multivariate polynomials (generally known as algebraic varieties) using tropical algebra (known as tropical varieties, in tropical geometry).

Definition 1.

The tropical semiring is , with addition and multiplication given by:

where both operations are commutative and associative, and multiplication distributes over addition. Similarly, there also exists the max-plus semiring , where max-plus multiplication of two elements is also defined as usual addition as above, but max-plus addition amounts to taking the maximum instead of the minimum:

Note that tropical/max-plus subtraction are not defined in these semirings. Tropicalization or max-plus tropicalization means passing from standard arithmetic to tropical/max-plus arithmetic, by replacing the standard arithmetic operations with their tropical counterparts.

The tropical and max-plus semirings are very closely related, and in some settings, the terms “tropical” and “max-plus” terms are used interchangeably: the term “tropical” is sometimes used even when the convention is adopted for addition with all elements negated.

Under these arithmetic operations, usual mathematical relations and objects of interest may be studied, such as lines, functions, and sets (see Maclagan and Sturmfels (2015) for examples and a complete, formal treatment). Moreover, binary operations on these semirings are linearizing, resulting in piecewise linear constructions of curves and functions. Such operations are desirable in reducing computational complexity. In the setting of this paper, these operations will be used in constructions on the space of phylogenetic trees.

2.2 Phylogenetic Trees and Metrics

The biological dynamics of, and relationships between, various taxa that evolve simultaneously and are believed to be related via a common ancestor are depicted graphically as a branching diagram known as a phylogenetic tree. The common ancestor is represented as a single node or root, and the evolution is represented by bifurcations, referred to as edges, which end in terminal nodes, known as leaves. From DNA, RNA, or protein multiple sequence alignment (MSA) of a finite number of species as the input, the aim is to reconstruct their evolutionary phylogeny and graphically represent it as a tree. Tree reconstruction techniques largely fall into statistical methods, and distance-based methods. In the former, optimality criteria are specified and achieved, for example maximum parsimony and likelihood (Fitch, 1971; Felsenstein, 1981)

. Similarly, Bayesian approaches estimate the posterior distributions of trees

(Edwards, 1970; Rannala and Yang, 1996). An important motivation behind the development of statistical methods lies in the extremely high number of possibilities for tree combinatorial types or topologies (i.e. the configuration of branch placement, together with a leaf labeling scheme) for a rooted binary tree with leaves,

The time complexity for likelihood phylogenetic inference methods has been studied by Roch and Sly (2017). The problem of finding the “optimal” tree is known to be NP-complete (Schröder, 1870; Foulds and Graham, 1982).

The latter approach deals with reconstructing trees from distance matrices (e.g. Sokal and Michener (1958); Fitch and Margoliash (1967); Saitou and Nei (1987)

). Distance-based methods for reconstructing phylogenetic trees first and foremost depend on the specification of genetic distance, such as Hamming or Jukes–Cantor distances, which measures distances between all pairs of sequences. Phylogenetic trees are then reconstructed from this matrix by positioning sequences that are closely related under the same node, with branch lengths faithfully corresponding to the observed distances between sequences. These pairwise distances between leaves in a tree can be stored in vectors, effectively representing the trees as vectors. Such vectors computed from various trees can then be used for comparative statistical and computational studies between these trees, which is the focus of this paper.

We now provide the formal setting of phylogenetic trees in the context of distances; further details to supplement what follows may be found in Pachter and Sturmfels (2005).

Notation.

In this paper, we consider , and denote , and .

Definition 2.

A phylogenetic tree is an acyclic, connected graph, , where is the set of labeled, terminal nodes called leaves, among which there is no vertex of degree . The set consists of its edges (also called branches), each with positive length, representing evolutionary time.

An unrooted phylogenetic tree with leaves is a tree with labels only on its leaves. A rooted phylogenetic tree on leaves may be obtained from an unrooted tree by setting the endpoint of the unique edge connecting to the leaf as the root.

In a phylogenetic tree, an edge connecting to a leaf is called an external edge; otherwise, it is called an internal edge.

Remark 3.

A technicality that we wish to highlight concerns the root: In typical depictions of phylogenetic trees (such as in Figure 6), a root appears to be a vertex of degree . To reconcile such a typical depiction with Definition 2, we can imagine an edge extending from the root to depict the -spider (such as in Figure 3), and then set the leaf at the end of this edge to be . This is what is described above in Definition 2, specifically in the procedure to obtain a rooted phylogenetic tree from an unrooted tree. It is common practice to disregard this edge altogether, hence a vertex of degree appears to exist at the root, but this is only for matters of computational and illustrative convenience.

Definition 4.

A dissimilarity map is a function such that

for every . If a dissimilarity map additionally satisfies the triangle inequality, for all , then is called a metric.

Notation.

In this paper, for notational convenience, we interchangeably write .

The relationship between dissimilarity maps and metrics is intrinsically tropical, which can be seen in the following result.

Lemma 5 (Butkovič (2010)).

Let be a dissimilarity map and let be its corresponding matrix. Then is a metric if and only if .

Definition 6.

A metric is called intrinsic if it gives the length of the shortest path between any two points in the metric space, and this path is contained within the metric space. This path is referred to as the geodesic between and .

For a tree with leaves, the pairwise distances between leaves are sufficient to specify a phylogenetic tree, which provides the link between the notions of metrics and trees presented above.

Definition 7.

Let be a rooted phylogenetic tree with leaves, with labels , and assign a branch length to each edge in . Let

where is the unique path from leaf to leaf . The map is then called a tree distance. If for all , then is a tree metric.

Note that tree metrics may be represented as cophenetic vectors in the following manner (Cardona et al., 2013):

(1)

Cophenetic vectors are also called tree metrics (in vectorial representation) in the literature on mathematical phylogenetics. The following definition gives a technical condition for tree metrics.

Definition 8.

The space of tree metrics for phylogenetic trees with leaves consists of all -tuples where the maximum among the following quadratic Plücker relations:

(2)

is attained at least twice for . This condition for a tree metric is called the four-point condition (Buneman, 1974). This condition may equivalently be expressed by requiring that the nonnegative, symmetric matrix have zero entries on the diagonal, and that

(3)

for all distinct .

Remark 9.

The space consists of all trees: by definition, in order to be a tree and an element of , it must satisfy the four-point condition. The four-point condition makes no distinction between rooted or unrooted trees; in the case of rooted trees, the root (i.e. leaf label ) is simply ignored and the condition applies to the leaves.

The four-point condition is a stronger constraint than the triangle inequality, and is the defining difference between a general metric and a tree metric. In other words, for a distance matrix to be a tree metric, it must satisfy not only the triangle inequality, but also the four-point condition. The implication is that general biological distance matrices do not necessarily give rise to phylogenetic trees: there may be biological processes with evolutionary behavior measured by differences, which may be captured by distance matrices (such as Hamming or Jukes–Cantor distances), however such processes may not necessarily be realized as phylogenetic trees since not all distance matrices satisfy the four-point condition. We now give two examples to illustrate this distinction.

Example 10.

Let and consider the matrix with 0 on diagonal entries, and , and , but . Then

Thus, (3) is not satisfied, and it is therefore impossible to construct a phylogenetic tree with 4 leaves such that for the length of the unique path connecting leaves and is equal to as specified above.

Example 11.

The tree metric for the tree in Figure 1 is . As a matrix , it is

The Plücker relations (2) associated with are

The maximum is achieved exactly twice, and . Also, (3) holds:

Figure 1: Example of an unrooted phylogenetic tree to illustrate the four-point condition.

The space is a subspace of the tropical projective torus , where denotes the vector of all ones. This quotient space is generated by an equivalence relation where for two points , if and only if all coordinates of are equal; in other words, tree distances differ from tree metrics by scalar multiples of . Mathematically, the construction of the tropical projective torus coincides with that of the complex torus, hence its name: all complex tori, up to isomorphism, are constructed by considering a lattice as a real vector space, and then taking the quotient group . Intuitively, this quotient normalizes evolutionary time between trees.

A strengthening of the triangle inequality specified in Definition 4 gives rise to an important class of tree metrics, which we will now discuss.

Definition 12.

Let be a metric. If for each choice of distinct , then is an ultrametric.

If is a tree metric, and the maximum among

is achieved at least twice for , then satisfies the three-point condition, and hence is a tree ultrametric (Jardine et al., 1967). The space of tree ultrametrics for phylogenetic trees with leaves consists of all -tuples satisfying the three-point condition for tree ultrametrics.

As in the case above for metrics to define tree metrics, the strengthening of the triangle inequality for a metric to an ultrametric applies to general metric functions, but analogously, in order for an ultrametric to be a tree ultrametric, it must satisfy the three-point condition. Rooted trees satisfying the three-point condition imply also that the four-point condition is satisfied. Remark 9 concerning the rootedness of the tree also translates to this case: the space consists of all trees satisfying the three-point condition. In this paper, we study the ultrametric condition in the context of phylogenetic trees. Thus, for convenience, though via somewhat of an abuse of vocabulary, throughout this paper, when we write “ultrametric,” we are referring to a phylogenetic tree with leaves satisfying the three-point condition for tree ultrametrics given above in Definition 12.

Definition 13.

A rooted phylogenetic tree with leaves is called equidistant if the distance from every leaf to its root is a constant.

Proposition 14.

A dissimilarity map computed from all pairwise distances between all pairs of leaves in a phylogenetic tree is an ultrametric if and only if is equidistant.

Proof.

For any two points on a tree, we denote by the length of the unique path connecting and . Suppose a tree is equidistant with root . Then for any three leaves , we have that . Since satisfies the four-point condition, the maximum among (2):

is attained at least twice. Thus, the maximum among is also attained at least twice, satisfying the three-point condition, and is therefore an ultrametric.

Conversely, suppose is an ultrametric. Then there are finitely many leaves in , so we can choose a pair of leaves such that is maximal among all such pairs. Along the unique path from to , there is a unique point such that . For any other leaf , consider the distance : Since the paths from to and only intersect at , the path from to intersects at least one of them only at . Suppose without loss of generality that the path from to is such a path. Then . Since , we have . If , then , and , so the maximum among is and it is only attained once — a contradiction, since was assumed an ultrametric. Hence , and has equal distance to all leaves of . Therefore is equidistant with root . ∎

2.3 Some Common Tree Metrics

In this section, we provide details on tree metrics that commonly appear in the literature on phylogenetic trees, and those that are particularly relevant in the present work. The selection presented here is far from exhaustive; for a more comprehensive survey, see Weyenberg and Yoshida (2016) and references therein.

The BHV Geodesic Metric.

Billera et al. (2001) introduced a geometric space that considers equidistant rooted phylogenetic trees on a fixed set of leaves as points. This is a moduli space where the points are trees, and the space is endowed with a metric that is defined by a geodesic between two trees. More specifically, the trees are expressed only by the lengths of their internal edges. External, or pendant, edges are not considered, since taking them into account does not affect the geometry of the space: including external edges amounts to taking the product of tree space with an -dimensional Euclidean space. Any two points (i.e. two rooted trees with a fixed set of leaves) are connected by a geodesic, with the distance between two points is defined to be the length of the geodesic connecting them. This space of rooted phylogenetic trees, with zero-length external edges, has since been referred to as the Billera–Holmes–Vogtmann (BHV) tree space, which has been extensively studied in many computational, and especially statistical, settings (Holmes, 2003). Since the goal of the present work is to introduce a new framework for statistical analysis on phylogenetic trees, much of what we will discuss in this paper will relate to BHV space in a comparative manner. In this paper, we denote the general BHV space by , and the case where rooted phylogenetic trees with zero-length external edges and finite branch lengths normalized to length by . The latter case of is also widely studied (e.g. Section 5 of Holmes (2003), Gavryushkin and Drummond (2016), and Lin et al. (2017)), including in Section 4.4 of the original paper by Billera et al. (2001). The standardization to unit edge lengths presents only a mild simplification of the geometry of (Holmes, 2003).

We now outline the construction of the BHV space: Consider a rooted tree with leaves. Such a tree has at most edges: terminal edges, which are connected to leaves, and at most internal edges. When a rooted tree is binary (that is, it is a bifurcating tree that has exactly two descendants stemming from each interior node), then the number of edges is exactly ; the number of edges is lower than if it is not binary. To each distinct tree combinatorial type or topology, a Euclidean orthant of dimension (i.e. the number of internal edges) is associated. In this setting, an orthant may be regarded as the polyhedral cone of where all coordinates are nonnegative. Thus, for each tree topology, the orthant coordinates correspond to the internal edge lengths in the tree. Since each of the coordinates in an orthant corresponds to an internal edge length, the orthant boundaries (where at least one coordinate is zero) represent trees with collapsed internal edges. These points can be thought of as trees with slightly different, though closely related, tree topologies. The BHV space is constructed by noting that the boundary trees from two different orthants may describe the same polytomic topology, i.e. a split, and thus grafting orthant boundaries together when the trees they represent coincide. The grafting of orthants always occurs at right-angles. The BHV space is thus the union of polyhedra, each with dimension . Each polyhedron may also be thought of as the cone . Note that an equivalent construction may also be outlined for internal edges for the case of unrooted trees.

To compute a geodesic in , first, the geodesic distance between two trees on the BHV tree space is computed, and then the terminal branch lengths are considered to compute the overall geodesic distance between two trees, by taking the differences between terminal branch lengths. Since each orthant is locally viewed as a Euclidean space, the shortest path between two points within a single orthant is a straight Euclidean line. The difficulty appears in establishing which sequence of orthants joining the two topologies contains the geodesic. In the case of four leaves, this can be readily determined using a systematic grid search, but such a search is intractable with larger trees. Owen and Provan (2011) present a quartic-time algorithm (in the number of leaves ) for finding the geodesic path between any two points in BHV space. Once the geodesic is known, its length, and thus the distance between the trees, is readily computable. Complete details on the computation of geodesic distances are given by Owen and Provan (2011).

Inner Product Distances.

The path difference , quartet distance , and Robinson–Foulds distance (or splits distance) are commonly-occurring tree distances, which can be formulated as a form of inner products of vectors. We now briefly describe these distances and illustrate them with the running example of two trees in Figure 2.

Figure 2: Example trees and for inner-product tree distances.

The path difference between two trees and is the Euclidean distance

where is the vector whose th entry counts the number of edges between leaves and in (Steel and Penny, 1993). For example, for the trees and in Figure 2, the coordinates of and are given by

where is the number of edges between leaf and . Thus,

and

A quartet in a tree is a subtree on four leaves induced by removing all other leaves from . For each choice of 4 leaves, there are four possibilities for the tree topology of the induced quartet. Let denote the set of quartets induced by a tree . The quartet distance is half the size of the symmetric difference of the trees’ quartets,

Referring to Figure 2, since all quartets in are the same except for the quartets whose leaves are and . Since is the square of a Euclidean distance, the distance is used as a matter of convention.

A split in a tree is a bipartition of the leaves induced by removing one edge from (splitting into two trees defining the bipartition). Let denote the set of all splits for a tree . The RF (or splits) distance (Robinson and Foulds, 1981) is half the size of the symmetric difference of the trees’ splits,

For example, for the trees and in Figure 2, we have since all splits in are the same, except for the splits obtained by removing the middle edge between and . Since is the square of a Euclidean distance, the distance is also used by convention.

Cophenetic Metrics & Interleaving Distances.

The copehentic vectorization of tree metrics (1) represents trees as points in the Euclidean space , which may be compared using the cophenetic metric, which is simply the distance (Cardona et al., 2013). In topological data analysis, the interleaving distance is constructed in the context of category theory and used to compare persistence modules. Recent work by Munch and Stefanou (2018) shows that the -cophenetic metric is in fact an interleaving distance, connecting the fields of mathematical phylogenetics and topological data analysis.

3 Palm Tree Space

In this section, we define and study the tropical moduli space space of phylogenetic trees — that is, the space of phylogenetic trees under tropical arithmetic and the tropical metric. We refer to the embedded space of phylogenetic trees equipped with the tropical metric as the palm tree space (i.e. tropical tree space). We begin by motivating the need for such a space, and give the definition and prove stability of the tropical metric with respect to the BHV metric. We then characterize the geometry and topology of palm tree space, and prove the existence of properties desirable for statistical analysis.

Motivation.

Locally, the BHV space of phylogenetic trees as described above behaves like Euclidean space, and the BHV metric is a CAT(0) metric, which is a geodesic. The CAT(0) property is a result on the curvature of the space, and derived from the right-angle grafting of orthants. Various algorithms to approximate and compute this geodesic have been developed (Amenta et al., 2007; Kupczok et al., 2008). However, it has been recently shown by Lin et al. (2017) that under the BHV metric, geodesic triangles computed by state-of-the-art algorithms (Owen and Provan, 2011) result in triangles of arbitrarily-high dimension.

Moreover, the depth of geodesics (that is, the maximal depth of points along the geodesic, where the depth of a point in the relative interior of some -dimensional polyhedron is ) is also shown to be large in BHV space: geodesics traversing orthants in BHV space are cone paths, which pass through the origin (i.e. tree without internal edges), and have maximal depth. Large depths are problematic since they give rise to the phenomenon of stickiness: A metric space is sticky if the mean lies stably at a singularity. More generally, let be a set of measures on some metric space endowed with some topology, with a mean that is a continuous map from to closed subsets of . A measure is said to stick to a closed subset if every neighborhood of in contains a nonempty open subset that consists of measures whose mean sets are contained in . Stickiness occurs in general stratified spaces, and tree spaces in particular (Hotz et al., 2013; Huckemann et al., 2015); see Example 15 for an illustration. The implication of stickiness is that perturbing any of the data points results in no change in the mean: the probabilistic and statistical consequences are the inability to determine an exact asymptotic distribution of the mean, which is essentially prohibitive for basic statistical inference, such as hypothesis testing.

Example 15.

In Figure 3, we position three unit masses on the -spider (that is, the rooted phylogenetic tree with three leaves and fixed pendant edge lengths) and calculate the position of the barycenter (Fréchet mean) by minimizing . The solution is for , and for . The Fréchet mean tends to stick to lower-dimensional strata.

Figure 3: -spider to illustrate stickiness.

The tropical metric, however, bypasses computational shortcomings of high-dimensional geodesic triangles as well as large geodesic depths, thus exhibits desirable properties for geometric statistics. In particular, the dimension of tropical geodesic triangles is at most , and the depths of random geodesic paths are low (Lin et al., 2017). These properties motivate its use and further study in the present work.

3.1 The Tropical Metric

Working on the tropical projective torus , we define a generalized Hilbert projective metric function (Cohen et al., 2004; Akian et al., 2011) on this quotient space and refer to it as the tropical metric.

Definition 16.

For any point , denote its coordinates by and its representation in by . For , set the distance between and to be

We refer to the function as the tropical metric.

Proposition 17.

The function is a well-defined metric function on .

Proof.

We verify that the defining properties of metrics are satisfied.

  • Symmetry: By definition, for , .

  • Nonnegativity: Since , so is .

  • Identity of Indiscernibles: If , then are equal for all , thus .

  • Triangle Inequality: For , we now show that . Suppose such that , then . Note that

    Hence

    The function satisfies the triangle inequality on .

Thus, is a metric function on .

With this metric, we formally define palm tree space as follows.

Definition 18.

For a positive integer , let be the space of phylogenetic trees as in Definition 8. The metric space is called the palm tree space.

Stability of the Tropical Metric.

Within , the following lemma ensures coordinate-wise stability of the tropical metric.

Lemma 19.

Let . For , if we perturb the th coordinate of by to obtain another point , then in we have

Proof.

For , the difference if , and . The set of these differences is then either or . By Definition 16, . ∎

The spaces and are not isometric, however, the tropical metric is nevertheless stable. In other words, perturbations of points in BHV space, measured by the BHV metric , correspond to bounded perturbations of their images in palm tree space, measured by the tropical metric. This stability property is desirable, since it allows for interpretable comparisons between the two spaces, and allows for “translations” in the typical BHV framework over to palm tree space.

Theorem 20 (Stability).

Let be the number of leaves in palm tree space and BHV space. Let and be two unrooted phylogenetic trees with leaves. Then the following inequality holds:

Moreover, the smallest possible constant is .

Proof.

We first prove that for any two unrooted trees with leaves, . First, assume that belong to the same orthant in BHV space. Then no matter what the tree topology is, if we denote the differences of the lengths of the internal edges in and by , and the differences of the length of the external edges by , we always have

(e.g. Owen and Provan (2011); Lin et al. (2017)).

For every pair of leaves in both trees, the distance between them is a sum of the length of some internal edges and two external edges. In other words, all differences are of the form of the sum between some , and . Thus, the maximum of these differences is at most the sum of all positive values, plus the two greatest values (take these to be and ), while the minimum of these differences is at least the sum of all negative values, plus two smallest values (take these to be and ). By definition, is the maximum minus the minimum of these differences, so we have

By the Cauchy–Schwarz inequality (Cauchy, 1821; Schwarz, 1890),

Hence

Now, for with distinct tree topologies, we consider the unique geodesic connecting them: there exist finitely many points in BHV space such that and belong to the same polyhedron corresponding to a tree topology for , where and , and . For , by the proof above, we have that

Thus,

Next, we consider the case where the equality holds: consider two trees and with leaves and the same tree topology, given by the following nested sets

Suppose in , the internal edges have lengths

Similarly, in , the internal edges have lengths

The external edge lengths of and are

Then

For , in either tree the distance is the sum of the edge lengths of

Since for and for , the maximum of all differences is

and the minimum of all differences is

By definition, in this case. Thus, is the smallest possible stability constant. ∎

There are several important remarks concerning this stability result to discuss:

Remark 21.

The stability constant is the best possible constant, however, it is not universal, since it depends on the number of leaves in a tree. For a fixed number of leaves, however, the value is indeed constant.

Remark 22.

It is important to note here that explicit calculations involving geodesics between trees in the original paper by Billera et al. (2001) do not include pendant (external) edges, since these do not modify the geometry of the space. Indeed, their inclusion only amounts to an additional Euclidean factor, since the tree space then becomes the cross product of BHV space of trees with internal edges only, and . Geodesic distances, which depend directly on geodesic paths (the former is the length of the latter), considered in Billera et al. (2001) also do not include pendant edges.

The statement of Theorem 20 treats the most general case of unrooted trees with pendant edges included in both tree spaces. Our reasoning for considering this case, which differs from what appears in Billera et al. (2001) as described above, is twofold. First, the procedure for calculating geodesic distances between trees in BHV space described in Section 2.3 follows Owen and Provan (2011), who explicitly consider pendant edges in computing geodesics, and whose algorithm is considered to be the current standard. Since the tropical metric includes pendant edges in its calculation, a study relative to the Owen–Provan procedure, rather than the procedure given in Billera et al. (2001) where pendant edges are excluded, provides a more valid comparison. Second, the present paper builds upon previous work on the tropical metric in Lin et al. (2017), which also deals with comparative studies relative to the Owen–Provan algorithm. We follow in this same vein for consistency.

In the interest of statistical interpretation, Theorem 20 provides an important comparative measure and guarantees that quantitative results from BHV space are bounded in palm tree space. For example, in single-linkage clustering, where clusters are fully determined by distance thresholds, the stability result means that a given clustering pattern in BHV space will be preserved in palm tree space, thus maintaining interpretability of clustering behavior.

3.2 Geometry & Topology of Palm Tree Space

A property of BHV space used above in the proof of Theorem 20 leads naturally to the study of geometric and topological properties that characterize palm tree space, and additional important differences between the two spaces. These characteristics will now be developed in this subsection.

Geometry, Geodesics & Computational Complexity in Palm Tree Space.

Recall from Definition 6 above, the property that a geodesic is a path contained between any two points in a metric space with its length as the shortest path between the points. We now characterize and discuss such paths in .

Definition 23.

Given , the tropical line segment with endpoints and is the set

Here, max-plus addition for two vectors is performed coordinate-wise.

Proposition 24.

For points , the tropical line segment connecting and is a geodesic.

Proof.

It suffices to show that for any , we have that

where is the tropical line segment. We may assume that for . Under this assumption, . Now if is the largest index such that , then for some , and, analogously, . If or , then is equal to either or and the claim is apparent. We may thus assume .

The set of all differences contains and the greater values for . So,

Similarly, the set of all differences contains and the smaller values for . So,

Therefore, and the tropical line segment connecting and is a geodesic. ∎

Geodesics in palm tree space are in general not unique. This is a common occurrence in various metric spaces.

Example 25.

Consider the following union of three -dimensional orthants

where the distance function is the usual Euclidean metric within each orthant. For two points where and , there exist two shortest paths: one that connects to by passing through , and the other passing through The length of both paths is