Geometric comparison of phylogenetic trees with different leaf sets

The metric space of phylogenetic trees defined by Billera, Holmes, and Vogtmann, which we refer to as BHV space, provides a natural geometric setting for describing collections of trees on the same set of taxa. However, it is sometimes necessary to analyze collections of trees on non-identical taxa sets (i.e., with different numbers of leaves), and in this context it is not evident how to apply BHV space. Davidson et al. recently approached this problem by describing a combinatorial algorithm extending tree topologies to regions in higher dimensional tree spaces, so that one can quickly compute which topologies contain a given tree as partial data. In this paper, we refine and adapt their algorithm to work for metric trees to give a full characterization of the subspace of extensions of a subtree. We describe how to apply our algorithm to define and search a space of possible supertrees and, for a collection of tree fragments with different leaf sets, to measure their compatibility.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/18/2020

Combinatorial and computational investigations of Neighbor-Joining bias

The Neighbor-Joining algorithm is a popular distance-based phylogenetic ...
04/12/2019

Dominator Chromatic Numbers of Orientations of Trees

In this paper we prove that the dominator chromatic number of every orie...
08/09/2020

A Flexible Pipeline for the Optimization of CSG Trees

CSG trees are an intuitive, yet powerful technique for the representatio...
01/17/2012

The computation of first order moments on junction trees

We review some existing methods for the computation of first order momen...
07/27/2007

Families of dendrograms

A conceptual framework for cluster analysis from the viewpoint of p-adic...
02/22/2020

Testing the Agreement of Trees with Internal Labels

The input to the agreement problem is a collection P = {T_1, T_2, ... , ...
11/03/2018

Efficient Projection onto the Perfect Phylogeny Model

Several algorithms build on the perfect phylogeny model to infer evoluti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

In the context of evolutionary biology, given a set of organisms referred to as taxa, a phylogenetic tree is a semi-labeled, weighted acyclic graph representing a possible evolutionary relationship between the taxa, using genotypic or phenotypic data. Such trees typically have a root which represents the common ancestor of the taxa, with a branch point at each speciation event, and a leaf for each taxon, such that the taxa which share more features are “nearer” to each other in the tree. Here the intrinsic tree distance is exhibited by shortest path length in the weighted tree: a series of edges without repetition gives a unique path from one leaf to another, and the sum of their lengths is distance, indicative of the genetic or phenotypic changes and differences between the taxa.

In topological data analysis, phylogenetic trees represent an important class of metric spaces, finite additive spaces, which exhibit no persistent topological features at any scale in degree larger than 0. In this way, techniques such as these we present can be viewed as complementary to homological methods.

Figure 1. Phylogenetic Tree of Life [26]

In addition to the intrinsic distances between the taxa that a single phylogenetic tree represents, we can also define an extrinsic distance between distinct phylogenetic trees with the same set of taxa. In 2001, Billera, Holmes, and Vogtmann defined a configuration space of possible phylogenetic trees relating a set of taxa [5]. These trees can be continuously parametrized by the topology and edge lengths, and the result is a contractible geodesic space with non-positive curvature, referred to here as tree space, or , for taxa labels . The extrinsic distance between two trees is realized as the length of the unique geodesic between the two points in which represent the trees.

To combine the data of more than two trees, e.g. if is a set of phylogenetic trees describing different evolutionary relationships between the taxa (leaf set) , is represented as a set of points in . By taking the mean of [17, 3, 4, 7], or clustering the points [10], or constructing confidence regions [27], we can describe in a way which incorporates the range of metric and combinatorial shape differences.

However, there are situations in which one of the assumptions of this model, that each tree in has a fixed leaf set , is not reasonable. For example, with improvements in sequencing technology, many phylogenetic datasets now consist of thousands of gene trees, each of which represents the evolutionary history of a single gene in the species set of interest [15]. However, not all genes appear in all species, and currently genes with an incomplete leaf-set are often discarded before beginning the analysis. A second example is comparing parallel evolutionary chains in viruses or tumors, where some strains are comparably similar across samples (and therefore can be considered the same leaf) but are not necessarily all present in every sample [30], i.e. each has its own leaf set which is contained in some common larger set . The fact that the trees belong to different parametrized spaces prevents us from using the techniques of BHV analysis described previously, but as we will show, tree sets with some “combinatorial compatibility” will admit a fairly precise notion of distance which is based on the BHV metric in , with no loss of data.

Our approach to this problem uses the tree dimensionality reduction map defined in Zairis et al. [30], which gives a map from a tree space to the lower-dimensional tree space that contains all trees with a subset of the leaves . This map is induced by the natural subspace projection. We will first construct the pre-image of this map, which can be used to recover information about the original tree from the images for varying . This map is also fundamental to the previous applications, which we solve by mapping to their preimages in the common domain space , and comparing the sets.

This precise problem, of analyzing trees with different numbers of taxa collectively in BHV tree space, was first approached by Bi et al. [2]. They developed the theory behind the combinatorial step in Section 3.3.1, toward the goal of comparing trees with different taxa sets. The algorithm presented in that section, together with Proposition 3.6, clarifies their results and shows their implications for the computation of tree dimensionality reduction and its preimage.

Analysis in BHV space is, of course, not the only way to approach problems of this type. Given the set , it is sometimes efficient to “prune” the trees to their common taxa for comparison, if such a set is sufficiently large to preserve important data. In this case, any tool for analyzing sets of trees with identical taxa can then be used. In the context of reconstructing a species tree from gene trees, the relationship between these trees is modeled by the coalescent process, and algorithms and approaches can specific to this situation can take advantage of this model [21, 18]. To avoid making simplifying assumptions, there are also some software packages currently available which use Bayesian coalescent-based techniques, from the original data rather than trees, to assemble multiple parallel, incomplete data samples into a single tree [9, 11, 14]. There are also algorithms, based on the (often reasonable) assumption that differences in topology arise from recombination events, that aggregate metric data into phylogenetic networks [22]. These can often accommodate non-uniform data as well. However, they share the same drawback as most classical phylogenetic tree algorithms, in that they produce a single tree or tree-like object, rather than a region of possible trees in tree space.

There is also the problem of supertree reconstruction, which aims to combine partially overlapping phylogenies into a common tree. Summaries and selected supertree methods can be found in Bininda-Emonds [6], Akanni et al. [1], Warnow [24], and Wilkinson et al. [28]

. The techniques in this paper give a conservative (low tolerance for topological error), split-based supertree method for BHV space, which does not necessarily represent an improvement on the search for a maximum-likelihood supertree; rather, we can rigorously (rather than heuristically) define the space of possible supertrees, in a manner amenable to search, and expand the possible analyses available.

With the geometric framework established in this paper, we can define and compute some useful objects. First, in Section 3, we show how to efficiently compute , the preimage of tree under the tree reduction map, which gives all trees with the full set of leaves that map onto . The algorithm, given in two parts, calculates the extension space , which represents the set of all phylogenetic trees which can result from adding additional leaves to . Theorem 3.12 shows that this construction, which extends the results and definitions of [2], coincides with in .

This fact immediately gives a method of finding the set of trees which satisfy the system for some collection of trees , and we suggest some shortcuts to speed up the process. This solution space is computed efficiently in Section 4 in a method similar to the one presented in Section 3, and is shown in Proposition 4.7 to be the intersection of sets in a common domain.

Stability concerns lead us to Section 5, which first defines an approximate solution space to with some parameter of constant error tolerance, or of error tolerance proportional to local size. These will be the products of Sections 5.1 and 5.2, and will allow for stability results (5.5) and (5.6). The proposition (5.5) implies an additional non-trivial fact about a set , that if it intersects a cubical face , it intersects all cubes .

From these definitions we can define two parameters and measuring the degree of metric distortion for a collection of trees satisfying a combinatorial compatibility condition. The parameters represent the minimum error tolerance (uniform or proportional) necessary to construct a supertree from the

. These parameters will result from linear optimization problems related to the equations defining the approximate solutions spaces, and can be directly computed using the most efficient linear programming methods available.

Some directions for future work are sketched in Section 6. The suggested projects include giving a full extension of the definitions, computations, and parameters to tree sets which need not be combinatorially compatible, and relaxations which can exceed the boundaries of its supporting orthants. Additionally, a probabilistic framework for random trees on different leaf sets would be able to give significance to the threshholding tests, which in this paper are merely heuristic.

2. Background

2.1. Phylogenetic trees

Definition 2.1.

A phylogenetic tree is an acyclic connected graph (a tree) with

  • No degree 2 vertices.

  • Degree 1 vertices each have a unique label. Such vertices are called leaves of . The set of leaf labels is denoted .

  • There is a positive weight for each edge , and the set of edges is denoted .

Unless indicated otherwise, for the number of leaves. Phylogenetic trees are sometimes rooted, meaning the tree has a distinguished leaf, the root, often an ancestor. For this paper, we will use unrooted trees, but all results carry over to rooted trees by fixing one of the leaves as the root, and assuming this leaf is in all trees considered. The topology of a tree is the unweighted underlying tree with leaf labels.

Because phylogenetic trees are acyclic, the removal of an edge separates into two connected components. Since leaves are vertices in one component or another, this gives a partition of into the two components and , called a split and represented as . When the ground set is obvious, we will suppress the complement and give a split by the smaller of its two partition sets, or if the two partitions are the same size, with the partition containing the lexicographically first leaf. A split is called thick if and both have cardinality greater than 1, or equivalently if neither endpoint of is a leaf. Such an edge or split is called internal.

Definition 2.2.

Two splits and are called compatible if one of: is empty. Two splits that are not compatible are called incompatible.

It is easy to see that one of these intersections being empty implies that the other three are non-empty. Compatibility of different splits and is equivalent to the existence of a tree such that the removal of one edge of gives , and the removal of another gives .

In fact there is a deep duality between phylogenetic trees and split sets: given a set of different splits on leaf set which are pairwise compatible, and weights for each, there is a unique phylogenetic tree realizing them (Buneman et al., 1971 [8]). Conversely, for a phylogenetic tree , the collection of all splits (one for each internal edge ) is pairwise compatible. A phylogenetic tree contains at most splits, and thick splits. It will be very useful for us to have this structural equivalence between a phylogenetic tree , and the split set which defines its topology. We will alternately refer to an edge and the partition it induces; for both, the weight is denoted .

If the external (leaf) edges of are also endowed with weights, then is equivalent to an additive metric space, whose points are leaves with the weighted path metric on . This is discussed further in Section 2.4.

2.2. Tree Space

For a fixed leaf set and a set of compatible thick splits on , there exists a unique tree topology realizing , as discussed in the previous section. We can then organize the set of all phylogenetic trees with this topology by their weight sets, ordered lexicographically by the corresponding split of each weight, in a space isometric to . We can include the boundary, by allowing weights to be 0, and this gives us , which is called an orthant. Maximal orthants have dimension . This is illustrated in Figure 2. The norm of a tree is the

norm of the vector of its split weights. We will denote the lowest-dimensional orthant containing tree

by , and the lowest-dimensional orthant containing all trees with exactly the splits by . Conversely, the set of splits contained in all trees in the interior of orthant is denoted by .

Figure 2. Left, a tree with 6 leaves, splits , , and with weights , , and , respectively. Right, the point in the orthant of representing . The cone point is shown at the origin.

If two sets of compatible thick splits, and , have splits in common, , then the orthants corresponding to and each have a boundary orthant that contains the same trees. We identify all such common boundary orthants to produce a single space, called the Billera-Holmes-Vogtmann (BHV) treespace and denoted , where is the leaf-set of all trees. When , we will alternatively write for the space. The empty split set produces a single point, called the cone point, , which represents the unique star-shaped tree with no internal edges. The cone point is contained in each orthant at the origin, so the identified space is path-connected. We define the distance between points and in this space to be the infimum of the lengths of all piecewise smooth paths from to , where path length is calculated by summing the distances of the path restricted to each orthant it passes through. The norm , via the straight line path from the origin to the point in the orthant containing .

The BHV treespace was first proposed by Billera, Holmes, and Vogtmann in [5], where they showed that it is a contractible, complete, and globally non-positively curved, or CAT(0), cube complex. Global non-positive curvature implies that there is a unique shortest path, or geodesic, between each pair of trees in the space. There exists a polynomial time algorithm to calculate this path and its length, given by Owen and Provan in [20].

For the purposes of this paper, we will have to keep track of the weights of edges ending in leaves as well, but since all trees in have the same leaves, and therefore the same leaf partitions, we can represent these globally with non-negative coordinates , and define tree space with this product

In this case, the cone point is the tree with no edges and all leaves identified into a single point. Importantly, has all of the important features of : it remains connected, globally non-positively curved, and contractible. As above, when , we may alternatively write for the space. The distance between trees can also be computed by a version of the algorithm of Owen and Provan [20].

2.3. Link graph

Definition 2.3.

The link of the cone point 0 is the set of all trees in which have internal edge lengths summing to 1. Homeomorphically, is the set of trees in at fixed distance from 0.

Because is a cube complex, is a simplicial complex; the face maps are restrictions of face maps of the cube complex, and every -face of the cube complex intersects the link in a -simplex. In particular, the 0-simplices correspond to splits of length 1, the 1-simplices correspond to compatible split pairs, and -simplices correspond to trees sharing the same non-zero splits which have edge lengths summing to 1.

can then be expressed as a cone on based at 0 (hence the name “cone point”), with the cone dimension parametrizing magnitude. Denote the 1-skeleton of the link . The global non-positive curvature condition on gives that is a flag complex, meaning that each -clique in bounds a -simplex in , which corresponds uniquely to the orthant of dimension spanned by the splits. This means that is recoverable from , which together encode all of the non-linearity of . In [2], and in the algorithm presented in section 3.3, is used to calculate the (combinatorial) extension objects and .

2.4. Tree dimensionality reduction

A weighted graph, endowed with the shortest path metric, is a metric space whose underlying set is the vertices of the graph. Acyclic graphs have unique geodesics, and so a metric tree with leaves can be equivalently considered as a metric on the set of leaves, with distance between two leaves given by the length of the unique path between them. A metric which arises from a tree in this way is called an additive metric, and satisfies the four point condition:

for all leaves .

The four point condition is also sufficient to determine additivity, which in turn implies the existence of a unique tree realizing this metric [8]. The additive distance matrix of a tree with leaf-set is denoted and is an matrix where the -th entry is , the distance between leaves and in tree .

A subspace of an additive metric space is additive, and additive subspaces can be seen as forming subtrees. Tree dimensionality reduction (TDR), as defined in [30], is a method of generating the tree for a subspace of an additive metric space from the original metric tree, and for a more general class of metric spaces called “nearly” additive. In this paper we will deal exclusively with additive metric spaces.

Definition 2.4.

Let be a tree with leaf set , and let . The tree dimensionality reduction map is the map sending to the induced subtree spanned by the leaves , where the induced subtree contains the vertices and edges on the shortest paths through between the leaves in , with each resulting degree 2 vertex and its incident edges with lengths and respectively, being replaced by a single edge with length . We refer to this process as concatenation of and .

We will also consider just the combinatorial reductions of splits, which we will refer to as projections, and which simply remove some of the leaves from one or both partitions of a split. For a split on leafset , the projection onto the leaf-set is the split . Note that one of or may be empty, in which case we would then discard this split. Since the tree dimensionality map operating on tree has the effect of projecting all splits onto the leaf-set , we will abuse notation and use to represent this combinatorial projection.

The following result states that the dimensionality reduction should also give the tree which will be constructed from partial information.

Proposition 2.5 ([30, Proposition 4.4]).

Let be a tree with leaf set , and additive distance matrix . Let , and define to be the submatrix of with rows and columns indexed by . Then .

Note that this formulation implies that certain dimension reductions act like projections: if , then on .

Example 2.6.

Starting with the tree on the left in Figure 3, tree dimensionality reduction to the leaf set is performed by first pruning the 5th leaf and its leaf edge, which gives the center tree. This tree has a degree 2 vertex, in red, which is removed, its boundary edges concatenated, to produce the final tree on the right.

Figure 3. Left, a tree with 5 leaves. Center, the tree with leaf 5 and its edge deleted, resulting in a degree two vertex (in red). Right, the tree after concatenating the two edges adjacent to the degree two vertex.

3. The Pre-Image of the Tree Dimensionality Reduction Map

The aim of this section will be to algorithmically construct the preimage of the tree dimensionality reduction map , for , . We start with a binary tree with edge lengths for , and want to describe and compute the set of all trees such that . Since by Proposition 2.5 the distance of the leaves to each other and to the leaves does not affect the distance between the leaves , many different tree topologies can map to under . Thus it is not immediately obvious how this set should be described.

As this section demonstrates, one effective approach is to:

  1. Note that for any , the topology of the image is completely determined by the topology of , and acts linearly on the edge weights in the orthant in . Thus, for a fixed maximal orthant of , restricts to a linear map . Any non-maximal orthant is on the boundary of at least three maximal orthants, and the linear map of any of these maximal orthants can be used.

  2. Find the orthants with a topology such that has the same topology as . By Proposition 3.6, these orthants can be determined by individual and pairwise properties of their splits, a surprising result.

  3. For a fixed orthant , form the matrix which encodes the way the edges of trees in concatenate under .

  4. Find the positive solutions of the linear system of equations , where is the vector of edge weights in , to determine the points such that when is performed, all of the edges of which concatenate to form an edge have weights summing to .

  5. Take the union of all of the orthant-wise solutions, and call this the extension space .

We will show that , and that the resulting space is connected, continuous, piecewise linear, of local dimension , and computable in cubic time relative to its size. We call the above algorithm the extension algorithm.

Note that we will assume that is binary, since an unresolved tree is often used in biology when the underlying relationship of certain leaves or subtrees is not known. In such cases, the edge lengths near the unresolved vertex would not necessarily represent the expected length of their corresponding split in the true tree, which is the main assumption of this paper. Thus we focus on binary trees in this paper, and leave incorportating unresolved trees into this framework for future work.

3.1. Extension by one leaf

To give some intuition for how the extension space relates to the original tree, and to show the mechanics of the base case for later results, we first examine the case where . This means finding the set of trees which have one additional leaf, labeled .

Definition 3.1.

Let be the tree dimensionality reduction map which deletes leaf and its adjacent edge, and concatenates the two edges at leaf ’s attachment point. We will refer to this reduction as an -pruning.

The reverse of pruning a leaf is attaching a new leaf to the tree with a new edge. We call this attachment operation grafting.

Definition 3.2.

For a tree , the tree is a -grafting of if , and .

In other words, a grafting of consists of a tree identical to , but with one additional leaf and its leaf edge . In considering the possibilities for such a grafting, there are two independent choices: the non-negative length of , and a point on at which to graft the non-leaf end. The next lemma shows the consequences of this, and a bit more.

Lemma 3.3.

For tree and leaf , the space of -graftings of , denoted , is the direct product of and a piecewise-linear connected curve which is graph-isomorphic to and which intersects a strict subset of orthants each in a 1-dimensional linear curve.

Proof.

Consider any tree , leaf and length . Recall that is the set of edges of tree , with each edge having split and length .

We can attach a new edge of length ending in leaf to any point, including an endpoint, on any edge of to get a -grafting of . Thus the set of -graftings of , , is not empty. For any , its additive metric restricted to the leaves is just the additive metric of , . It follows can be completely characterized by two independent choices: the choice of point on for grafting, the space of which is graph-isomorphic to , and a choice of length for the grafted leaf edge, which can be any non-negative real number.

Let be the edge to which , which has split , will be grafted to form . If we are grafting to a vertex of , then choose to be one of the edges adjacent to this vertex. For each edge , the two partitions of the leaves in the corresponding split induce two subtrees of , and edge is completely contained in one of these subtrees. Add leaf to the partition of corresponding to this subtree to get , the corresponding split in . The split becomes the splits and in . If was grafted to an endpoint of , then one of will have zero weight, but we will still include it here as a split for consistency. Thus has precisely the splits .

For each edge , the weight of split in is the same as the weight of split in , since the edge corresponding to projects to the edge corresponding to without distortion. Thus, we will represent the weight of edge in by as well. Split has weight , and let splits and have weights and , respectively. Then the space of all formed by grafting leaf to edge is a two-parameter family satisfying , and . Note that is a free parameter, and is the equation of a line. Thus this solution space in this orthant is the direct product of with the line that intersects the orthant boundaries at and at .

It remains to show that the lines given by in each orthant are connected and graph isomorphic to tree . Let and be two adjacent edges in , separated by vertex . Edges and are compatible because they exist in the same tree, and thus the intersection of one partition from each split is empty. Without loss of generality (by temporarily renaming the partitions if necessary), assume that . Then the case corresponds to a tree with splits , with weight , and , with weight , as well as splits , with weight , for all , and , with weight . The case corresponds to a tree with splits , with weight , and , with weight , as well as splits , with weight , for all , and , with weight . But these are identical split and weight sets, and thus the two line endpoints coincide. Since the two of these line segments meet if and only if they correspond to attaching leaf to adjacent edges in , we get that the piecewise-linear connected curve is graph-isomorhpic to . ∎

Example 3.4.

Suppose we have a tree with labels as depicted in Figure 4, with leaf edges having length respectively, and interior edge length . The corresponding additive distance matrix (indexed respectively) is given by

Then the preimage of is the product of the subspace of depicted on the right in Figure 4 (with leaf edge length for determined uniquely by the point on below) and the copy of (not shown) representing the “4”-leaf edge length. If we fix the length of the 4 leaf, the -grafting of is the subspace shown by a thick line, together with unique local leaf coordinates

where are the weights of splits , respectively, if that split exists in the tree, and 0 otherwise.

While it may appear that the four line segments corresponding to grafting to a leaf edge end mid-orthant, this is only because the figure omits the dimensions of those orthants corresponding to the leaf edges. The line segments ends on boundaries where the respective leaf edge lengths are 0.

Figure 4. Left, a tree with 4 leaves, . Right, the orthants of containing the preimage , with the subspace corresponding to the preimage shown with the thick solid lines. Note that the dimensions corresponding to the 4 leaf edges lengths were not included for clarity.

3.2. Extension by Multiple Leaves

As defined in [2], the connection cluster of a tree topology on leaf set is the set of binary tree topologies with leaves obtained from adding leaves to arbitrary edges of . We will generalize the definition of a connection cluster to allow the leafset of to be any subset of , and use the notation , where and . Throughout this section, we will still assume that , and . The connection space in the notation of [2], or in our notation, is the union of the closed orthants in that represent the elements of , i.e. a non-negative real orthant for every unweighted tree in under the normal identification of faces. The connection graph , or with a change of notation, , is the intersection of with the link , in which maximal cliques give elements of . In [2] and Lemma 3.7 below, it is shown that the edges of a connection graph are determined by normal pairwise compatibility of splits in , which allows for quick computation of .

The connection space can also be seen as the preimage in under of the entire orthant represented by , namely Similarly, the connection graph is the corresponding preimage of the complete -graph on . We are then interested in the subspace of , restricted by the edge lengths of , which projects under tree dimensionality reduction to . This subspace will be a -dimensional linear submanifold supported in . In other words, once the combinatorics of the extended trees are calculated through the connection cluster, we can use a set of linear equations parametrized by the edge lengths in to constrain sums of fixed edges in , and give the complete preimage .

3.3. Calculating the Metric Extension Space

In this section we will construct, for phylogenetic tree , the subset which results from gluing leaves of arbitrary length to the metric tree . The computation of the extenstion space has two steps:

The first step is the computation of , via the method in [2] for constructing and . We will see that this is the preimage under of the orthant containing .

The second step introduces the constraint that under the action of on , the process of deleting and concatenating edge lengths as described in Definition 2.4 yields precisely. To find the trees which satisfy this constraint, we solve a system of linear equations separately for each orthant in .

3.3.1. Combinatorial Step

As in the previous section, we let be the splits of (including the leaf edges), with corresponding lengths . We will first state the algorithm for computing the connection cluster and give an example, before proving correctness.

Computation of Connection Cluster

  1. For each , construct the set of splits projecting to by adding the labels to or in all possible ways.

  2. Take the union to get the vertices of the connection graph . Add an edge between each pair of vertices if and only if the two splits are compatible, which can be checked by the condition given in Definition 2.2.

  3. Find all maximal () cliques in the subgraph of thick partitions, which is found by removing the leaf splits, . Extend each maximal clique to include the leaf partitions, which are compatible with all other partitions, and return the corresponding set of cliques .

Example 3.5.

Returning to the tree in Example 3.4, we find using the above algorithm. The set of splits , so in Step 1, we find the set

In the second step, we form the graph , which is shown in Figure 5.

Figure 5. The connection graph for tree from Example 3.4. The vertices corresponding to elements of are labeled by the smaller of the two pieces of the partition. The leaf partitions have automatic compatibility - these edges are shown dotted, while compatible thick partitions have colored edges.

In Step 3, we find maximal -cliques in the thick subgraph. The -cliques are edges, and for each edge, we can include all of the leaf edges to that set to obtain a unique topology of . All such topologies form the connection cluster . The orthants corresponding to these topologies are precisely those pictured in Example 3.4, and form , the connection space, which is shown again in Figure 6 without the leaf dimensions.

Figure 6. The connection space for tree from Example 3.4.

The proposition below shows that the set of cliques returned in the final step of the algorithm is indeed the connection cluster , justifying the notation.

Proposition 3.6.

For with , the above algorithm returns the cliques , which correspond to the orthant support of .

First we show a preliminary result allowing us to reduce to conditions on the vertices of the extension graph.

Lemma 3.7.

For tree with , an orthant contains an element of if and only if . That is, contains a tree in the extension space of if and only if removing the labels from the splits yields precisely the split set of (with multiplicity).

Proof.

We proceed by induction on .

If and is an extension of by grafting leaf to edge , then from the proof of Lemma 3.3, has split set . Recall that removing edge from induces two subtrees, the vertices of which become the two parts of splits , and that was constructed from by adding leaf to the partition corresponding to the subtree to which was grafted. Thus projects to by construction for all . Similarly, and were constructed such that they project unto . Finally projects onto a split with one partition empty, which we delete.

Conversely, if a set of pairwise-compatible splits on projects to under deletion of some leaf , then we claim there exists a unique split which has two preimages. Suppose not. That is, suppose for and splits in , the collective split preimages are , , , and . Then compatibility of and in guarantees that precisely one of is empty, say without loss of generality . Then and are not compatible, because none of the four intersections of their partitions are empty. Thus contains only one of them. So for any pair of splits in , there are at most 3 preimage splits in , and unique splits have distinct preimages, so we conclude that there is a unique split in with both preimages, i.e. the set must look precisely as above, , and from this we can construct uniquely by grafting the -leaf edge to the middle of edge .

So we have the result for the case.

Then assume for induction that there exists such that , if and only if . Then let be an orthant in . So then is an orthant in , and applying the inductive hypothesis, there exists with if and only if . Since from the one-step case, and , giving us the forward direction. For the reverse direction, we know that which means that there is some tree such that by the base case. For this tree, then, , and the proof is complete.

Proof.

(of Proposition) Suppose we have a maximal clique in . Then this clique represents a set of pairwise compatible splits. Since is a flag complex, these splits represents an orthant in , of dimension corresponding to the size of the clique. By Lemma 3.7, these splits projects to the splits of , so the orthant contains elements of the extension space.

Conversely, suppose a tree is in the extension space. Then by Lemma 3.7, the splits of are among the vertex set of , and since is a tree in , its splits are compatible. Since this is the condition for connectivity in as well as , maps to a clique in . ∎

Proposition 3.8.

This algorithm is .

Proof.

In the first step, we do a simple enumeration, with run time . The second step of removing duplicates and initializing the graph is then , and to check compatibility is in each pair, so has . By [23], the run time of maximal clique enumeration is , and from [2] we have that the vertex set has size , and the edge set size being at most the square of this, we have a run time for clique enumeration. This dominates the other steps, which gives the result. ∎

Note that while this is fairly quick in , it may be the case that we have small fragments of large trees, implying a very dominant term. In this case, the algorithm is essentially reconstructing a large portion of , and so there is not much improvement which can be made, since the solution space itself is large. In the next section we will address a method for handling small tree fragments among a set of tree fragments.

3.3.2. Metric Step

Consider an orthant , and index its corresponding splits by (for example, in lexicographical order). By construction, for some . We represent this assignment with a projection matrix , where

Since this is a well-defined map from to , columns each have a unique non-zero entry. We then set up the real system of equations:

(1)

for the vector of non-negative edge weights in ( the weight of split ), and the vector of edge weights in .

To see what this system is producing, notice that this specifies, for each split in with weight , the equation

for projecting to , so that under tree dimensionality reduction , the (non-negative) lengths of the edges of a tree in concatenated to produce edge sum precisely to . So solving this system of equations would find vectors of possible edge lengths in tree topologies which project to .

Definition 3.9.

Given an orthant , which, alternatively, has splits corresponding to a clique in and a topology in , we call the set of satisfying (1) the extension space of in , denoted or . The extension space of in is defined to be the union of extension spaces over all orthants in the connection space:

Note that the image of under tree dimensionality reduction to gives a partition of the set into precisely components, because is well-defined and surjective on ’s. Because it is a partition and , we are guaranteed a solution of dimension to the equation above, and a total solution space of dimension

This generalizes the single leaf extension case in that, after the equations are solved for all orthants, the result is the direct product of a piecewise-linear connected -manifold (intersecting a strict subset of orthants each in an -dimensional linear subspace), with . Connectivity follows from the consideration that if two orthants share a -dimensional face, then that face is represented as a -clique in the connection graph, and the metric extension space meets the face in a set of equations of precisely the same sort on each side.

Proposition 3.10.

For leafset , let be a binary tree. The extension space of , , is connected. Furthermore, for adjacent orthants , .

Proof.

For each orthant , the extension space is connected, since it is the solution of a linear system of equations, restricted to the non-negative orthant. Any two adjacent orthants share some -dimensional boundary orthant, which corresponds to a -clique in the connection graph. Suppose the splits in the clique are . Then any solutions , on the boundary only have non-zero weights for the splits . Furthermore, since the projection of each onto a unique split in does not depend on the orthant, when we remove the 0 weights from each system of equations ( and ), the two systems of equations will now be identical. Therefore the intersection of and is precisely each of their intersections with the boundary orthant . ∎

Example 3.11.

Returning to the tree from Examples 3.4 and 3.6, based on the projection which deletes the label “4”, we set up the following linear system.

Without the leaf dimensions, the portion of the extension space pictured in Example 3.4 is specified by the first equation and the non-negative constraints.

Theorem 3.12.

Let and . Then .

Proof.

By construction and Proposition (3.6), , so for each , i.e. and intersect the same orthant set, given by . Furthermore, the procedure of dimension reduction as given in Definition 2.4 guarantees that each edge