A commonly used hypothesis in various applications in evolutionary biology is the molecular clock. For example, a strict molecular clock is the assumption that the mutation rate of a gene is approximately constant over time. After this phenomenon had first been observed by zuckerkandl1965evolutionary, the molecular clock became a popular hypothesis, and various relaxations were developed Kumar2016-eu. A popular framework for reconstructing and analysing timed evolutionary (species) trees Kingman1982-df that uses the molecular clock assumption on gene trees is coalescent theory. For example, coalescent is widely employed for inferring relationships of a sample of genes Hudson1990-ki, Kuhner2009-jb, or for analysing population dynamics Kuhner1998-eh,Drummond2005-ak. A recent striking application of coalescent theory is cancer phylogenetics Posada2020-aa, Ohtsuki2017-su, where accurate estimates of divergence times are essential for targeted treatment strategies. Under a coalescent model evolution is considered backwards in time, and two lineages coalesce after a waiting time, which is to be estimated.
In phylogenetic trees, which display evolutionary relationships, internal nodes can hence be equipped with times, when assuming a molecular clock. Software packages for reconstructing those trees from data such as RNA, DNA, or protein sequences rely on a parameterisation of trees where internal nodes are equipped with times. Popular tree inference software used for this purpose are based on Maximum Likelihood Kozlov2019-cf, Nguyen2015-sp, Tamura2011-ky or Bayesian methods Bouckaert2014-ir,Suchard2018-tw, Ronquist2003-eq. They rely on tree search algorithms, where in every step a new tree is proposed and accepted if the proposed tree fulfils certain requirements. For tree proposals under the molecular clock assumption a parameterisation of trees taking the times of internal nodes into account is required. Furthermore, a similarity measure for these trees is necessary, to propose trees that are measurably similar to the running tree.
Tree spaces that take branch lengths of trees into account exist in the literature. For example, the BHV-space Billera2001-rj models trees as points in a cubical complex. However, this parameterisation is not suitable for coalescent trees because changing the times of an evolutionary event in the tree implies that all preceding events change their times as well. Hence two trees can be close to each other in this space even though the timing of many internal nodes is different in the two trees. Examples of more suitable tree spaces where internal nodes of trees are equipped with times are -space and -space Gavryushkin2016-uu. It has been observed, however, that in the -space, similarly to the BHV-space, shortest paths between trees often contain the star tree Gavryushkin2016-uu, a property that can be problematic in applications. Although the -space is free from these properties, no algorithm for computing distances or shortest paths between trees in this space is known yet, so applications are limited.
Enabling statistical analysis over the space of phylogenetic trees was an important motivation for Billera2001-rj to introduce the BHV-space and study its geometric properties. Tree space geometry has also played an important role in studies of rogue taxa in a tree Cueto2011-bh and also summary trees Miller2015-rk. Here, driven by the same motivation, we propose to study coalescent trees.
In this paper we introduce the space of discrete coalescent trees, where internal nodes are assigned unique discrete times. This tree space is a discrete version of the -space. is also a generalisation of the ranked nearest neighbour interchange () space Collienne2020-iu. Here we show that the space as well as have the desired properties mentioned above, including efficiently computable shortest paths that preserve biological information shared between trees. After introducing notations used throughout this paper (Section 2), we discuss how the algorithm FindPath Collienne2020-iu can be generalise from to be applied to discrete coalescent trees, computing shortest paths in polynomial time (Section 3). We then analyse some geometrical properties of both tree spaces and (Section 4) – first, we discuss the cluster property in Section 4.1 and then consider a subset of trees (caterpillar trees) for which we are able to compute distances more efficiently than with FindPath (Section 4.2). Following that, we establish the diameter of and and briefly discuss the radius for each space. We finish this paper with a section providing a connection between the space and partition lattices, and propose directions for further research (Section 5).
2. Technical Introduction
A rooted binary phylogenetic tree is a binary tree with leaves uniquely labelled by elements of a set . The main object of study in this paper are discrete coalescent trees, binary rooted phylogenetic trees with a positive integer-valued time assigned to each node. More specifically, all leaves are assigned time , and every internal node is assigned a unique time less or equal to an integer , such that it always has time greater than its children. Note that this implies . We denote the time of an internal node by . If not stated otherwise, we refer to discrete coalescent trees simply as trees. We furthermore call two trees (not necessarily binary) identical if there is a graph isomorphism between them preserving leaf labels and times.
As a special case of discrete coalescent trees we consider ranked trees with root time . In these trees internal nodes have distinct times ranging from to . This definition of ranked trees coincides with the one of Collienne2020-iu. In the case of ranked trees we say rank of a node to mean its time () to be consistent with notations used in Collienne2020-iu. There are ranked trees Semple2003-nj. Every ranked tree gives discrete coalescent trees, as every -element subset of can be the set of times assigned to the internal nodes of a ranked tree. Hence there are, contrary to the claim in Gavryushkin2018-ol, discrete coalescent trees.
Every internal node of a tree can be referred to by the set of leaves that are descending from this node. We call such a set cluster and say that the cluster is induced by . A list of clusters determines at most one ranked tree Collienne2020-iu, where cluster is induced by the internal node with rank for . For discrete coalescent trees however, times of nodes also need to be provided to uniquely identify a tree. For a subset we call the internal node of a tree with lowest time among those ancestral to all elements of the most recent common ancestor of and denote it by . We furthermore denote the parent of a leaf in by , and the cluster induced by the node with time in by . The node highlighted in Figure 1 for example can be referred to as , the parent of , or , the most common ancestor of , or , the node with time three in . Note that we will simply write or to mean or , respectively. Although differing from traditional notations, our notation with brackets referring to internal nodes is intuitive, shortens nested formulas, and is consistent with notations used in Collienne2020-iu. A type of trees that will be of importance throughout the whole paper are caterpillar trees, which are trees where every internal nodes has at least one child that is a leaf.
We are now ready to introduce the central object of study of this paper, the graph (or space) of discrete coalescent trees. This graph is called for a fixed positive integer . The vertex set of is the set of trees with root time less or equal to . Note that a second parameter of is the number of leaves of the trees in the graph, which we assume to be fixed throughout this paper. Trees and are connected by an edge ( and are neighbours) in this graph if performing one of the following (reversible) operations on results in (Figure 2):
An move connects trees and if there is an edge in and an edge in , both of length one, such that shrinking and to nodes results in identical trees.
A rank move on exchanges the times of two internal nodes with time difference one.
A length move on changes the time of an internal node by one.
A length move can only change the time of a node to become if there is no node with time already. Furthermore, the time of the root of a tree in cannot be changed by a length move to become greater than in . Note that our definition of differs from the definition of the space on discrete time-trees of Gavryushkin2018-ol. In contrast to their definition, length moves in do not change the height of a tree, unless it is performed on the root, which makes our definition appropriate for coalescent trees.
The definition of leads to a natural definition of the distance between two trees and in this graph as the length of a shortest paths between these trees, denoted by . We also consider the ranked nearest neighbour interchange () graph of Collienne2020-iu, which is the graph for , and hence a graph of ranked trees. In this graph length moves are not possible, so we use the notion move to mean either a rank move or an move in order to distinguish these moves from length moves.
3. Computing Shortest Paths in
Shortest paths, and therefore distances, between trees in can be computed with the algorithm FindPath, which was introduced by Collienne2020-iu and has running time quadratic in the number of leaves . As is a special case of for , the question arises whether a modification of this algorithm can also be used to compute shortest paths in . In this section we present a generalisation of FindPath that computes distances between trees in . Before introducing the version of FindPath for , we introduce a way to convert trees in on leaves into ranked trees on leaves, such that the distance between those ranked trees equals their distance in (Theorem 1).
A tree in on leaves can be converted into a ranked tree in with leaves in the following way (Algorithm 1). First add a new root with time that becomes parent of the root of . The other child of this new root becomes the root of a caterpillar tree on leaf set , such that . An example of this extension of a tree to a ranked tree is depicted in Figure 3.
Throughout this paper we denote this extended ranked version of a tree by . Moreover, we denote the subtree of that is identical to by ( for discrete coalescent tree) and the caterpillar subtree on leaf set by .
In the following we distinguish two different types of rank moves. Rank moves between one node of and one node of induce length moves on the subtree in (Figure 3). Therefore, we will refer to such rank moves as rank moves corresponding to length moves. All remaining rank moves will still be called rank moves. Note that the correspondence of rank moves between and to length moves in shows that any path between and in can be interpreted as a path between and in .
After extending both trees and in to ranked trees and on leaves, respectively, we can compute shortest paths between and in , using FindPath. A path computed by FindPath preserves clusters Collienne2020-iu, hence there are no moves in the newly added caterpillar subtree on the leaf set on such a path. The only moves involving internal nodes of this caterpillar subtree are rank moves corresponding to length moves, as described above. Hence the path provides a path between and in , when only considering the subtrees induced by in all trees on , interpreting some rank moves between and as length moves. We denote this path, which results from , by . In Theorem 1 we establish that is indeed a shortest path in . Note that for any given pair of trees and , we always assume to be the maximum root time of these trees and consider a shortest path between them in .
The path between two discrete coalescent trees and is a shortest path in , where is the maximum root time of and .
Let and be discrete coalescent trees and and their extended ranked versions computed with Algorithm 1, respectively. Any path in from to gives a path of equal length between and in the space on leaves. This is due to the fact that the only moves needed in the subtree to transform it to are rank moves corresponding to length moves, and no other moves. If there was a path between and shorter than , the corresponding path between and in would be shorter than the one computed by FindPath in this space. Since this contradicts the fact that FindPath computes shortest paths in [Theorem 1]Collienne2020-iu, it follows that is a shortest path in . ∎
Theorem 1 shows that FindPath computes a shortest path between two trees in in polynomial time, more specifically in . More details on the running time are discussed in Section 4.3 following Theorem 6. It is not even necessary to convert a given pair of discrete coalescent trees to ranked trees to apply FindPath to them. Instead, we modify FindPath for trees in (Algorithm 2). Iterations of FindPath that consider clusters in the added caterpillar trees are replaced by length moves increasing the time of internal nodes as described in the for loop in Line 11 of Algorithm 2. The benefit of this modified version of the algorithm, compared to using FindPath on the extended ranked versions of the trees, is a reduced use of memory, which is especially of practical relevance for , which is typical in applications.
Note that we do not need the parameter in practice, as the distance between any two trees in is the same as their distance in for any . Therefore, if the distance between two trees is to be computed, we can simply choose to be the maximum root height of the given trees and compute their distance in .
4. Geometrical Properties of
4.1. Cluster Property
A tree space has the cluster property, if all trees on every shortest path between two trees sharing a cluster also contain . This is a desirable property in evolutionary biology applications as trees sharing a cluster or subtree are expected to be closer to each other than to a tree not sharing a cluster with them. This property is also desirable in centroid-based tree sample summary methods. For a given sample of trees containing a common subtree, it is expected that their summary tree also contains this subtree. It is therefore desirable to have a tree space that has the cluster property.
A mathematical motivation for investigating the cluster property in is its importance in a similar tree space, the nearest neighbour interchange graph (). In the graph, trees have no times and moves are allowed on every edge. Computing distances in is -hard Dasgupta2000-xa, and the proof relies on the fact that this tree space does not have the cluster property Li1996-zw. In the graph, however, distances can be computed in polynomial time using FindPath Collienne2020-iu, which preserves common clusters. The question whether has the cluster property is hence natural, and will be settled in Theorem 2.
The graph has the cluster property.
We assume to the contrary that there are two ranked trees and sharing a cluster and a shortest path between these trees where is not present in every tree. We furthermore assume that there is no pair of trees with a shorter path not containing a shared cluster and distance less than , meaning that and give a minimum counterexample. Because of this minimality assumption on the length of , the first tree following on does not contain . Since must be the only cluster changed by the move between and , all nodes with rank below induce the same clusters in and (Figure 4). We now compare distances and by using properties of FindPath.
First we compare and . All trees on these two paths coincide up to iteration , in which the cluster considered on is . Let denote the tree at this point of the path, meaning that and coincide up to this tree . It follows and .
Now consider to evaluate . As FindPath preserves clusters, is present in every tree on up to and including . The first iteration of FindPath applied to the pair of trees hence considers the cluster , as all cluster induced by nodes below coincide in and . To construct the cluster in , there is just one move needed, which results in the tree , as and are neighbours such that contains and does not (Figure 4). We can therefore conclude that , which contradicts the assumption that is the first tree on a shortest path from to . There is hence no shortest path between and that does not preserve , which proves the cluster property for . ∎
The fact that the slightly modified version of FindPath computes shortest paths in already suggests that shortest paths in and have similar properties. Indeed, the cluster property in follows from Theorem 2.
The graph has the cluster property.
Assume that there is a shortest path between two trees and in that does not preserve a common cluster. This path corresponds to a path between and , the extended ranked versions of and in , as already discussed in Theorem 1. Since this path has the same length as the one between and , it is be a shortest path in as well, which leads to a contradiction to Theorem 2. ∎
4.2. Caterpillar Trees
In this subsection we focus on the set of caterpillar trees and establish some properties of shortest paths between those trees in both and . In Theorem 3 we will see that, in both and , any two caterpillar trees are connected by a shortest path consisting only of caterpillar trees. We say that a set of trees is convex in a tree space, if there is a shortest path between any two trees in this set that stays within the set. The set of caterpillar trees is hence convex in . The space of unranked trees however does not have this property Gavryushkin2018-ol. Based on the convexity of the set of caterpillar trees in we introduce a way to compute distances between caterpillar trees in this space in time in Corollary 2, and hence with better worst-case time complexity than FindPath. Whether this complexity can be achieved in is an open question.
The set of caterpillar trees is convex in .
Let and be two caterpillar trees in . We prove the theorem by showing that there is a caterpillar tree that is a neighbour of and closer to than . The existence of a shortest path consisting only of caterpillar trees between and follows inductively. Throughout this proof we consider the extended ranked versions and of and .
Let be the leaf with parent with maximum rank in among those whose parents do not have equal rank in and . Let furthermore be a leaf with . We define to be the caterpillar tree resulting from by an move or rank move exchanging the ranks of and . An move is necessary if these two nodes are connected by an edge, otherwise a rank move corresponding to a length move is performed on to obtain (Figure 5). In both cases is a caterpillar tree. We will use properties of shortest paths computed by FindPath to show that .
Since all clusters of and induced by nodes of rank less than coincide, the paths and coincide up to a ranked tree , which contains all these clusters. We now compare the lengths of and . We note at first that . If it otherwise was , it would follow , by the definition of , and therefore . however contradicts the definition of , hence . It follows , as and are not in any of the clusters considered by FindPath before , which means that their parents do not exchange ranks before .
By our assumptions on , the cluster considered on in iteration , which is the iteration following , is , where is a cluster that is present in all three trees and . In the following iteration , is considered for a cluster , where either equals , if and are connected by an move (bottom of Figure 5), or is a cluster present in , , and , if and are connected by a rank move (top of Figure 5). Decreasing the rank of takes moves. Because the rank of increases by one when the parents of and swap ranks in this iteration, the following iteration for needs moves. On however, first moves decrease the rank of , and then are needed for . In total, these two iterations combined result in at least one extra move on comparing to .
The only difference in the trees after iteration on the two different paths is the order of ranks of the parents of and . Since the rest of and coincide, the remaining parts of and consist of the same moves. With our previous observation we can conclude , and hence is on a shortest path from to . ∎
Note that it follows that the set of caterpillar trees is convex in . This convexity property implies that the distance between caterpillar trees can be computed more efficiently than by FindPath. We prove this in the rest of this section. To do so, we first establish that the problem of computing a shortest path consisting only of caterpillar trees can be interpreted in a few different ways.
One problem analogous to the shortest path problem for caterpillar trees in is the Token Swapping Problem Kawahara2017-ey on a special class of graphs, so-called lollipop graphs. An instance of the token swapping problem is a simple graph where every vertex is assigned a token. Two tokens are allowed to swap positions if they are on vertices that are connected by an edge. Each token is assigned a unique goal vertex, and the aim is to find the minimum number of token swaps for all tokens to reach their goal vertex.
The problems of computing distances between caterpillar trees can be seen as an instance of the token swapping problem on lollipop graphs. A lollipop graph is a graph consisting of a complete graph that is connected to a path by one edge. An instance of the token swapping problem that corresponding to the distance problem for caterpillar trees is described in the following. An example is illustrated in Figure 6. Let and be caterpillar trees with
such that is a permutation of . The corresponding instance of the token swapping problem consists of a lollipop graph consisting of a complete graph on three leaves, connected to a path of length by an edge. The vertex in the complete graph incident to the edge connecting complete graph and path is labelled by , the other ones in the complete graph are labelled by and . The vertices on the paths are then labelled inductively, starting at the neighbour of , such that the neighbour of the last already labelled node with label is labelled by . The token on vertex has as goal vertex. Since the only moves between two caterpillar trees in are moves, which simply swap two leaves, they correspond to swapping two tokens in the above described instance of the token swapping problem.
Therefore, the algorithm described by Kawahara2017-ey to solve the token swapping problem on lollipop graphs can be used for computing distances between caterpillar trees. It however has worst-case time complexity , the same as FindPath.
In the following we present an algorithm for computing distances between caterpillar trees with better worst-case time complexity, , for (Corollary 2). To do so, we first establish a formula to express distances between two caterpillar trees in (Theorem 4). This algorithm can also be used to solve the token swapping problem on lollipop graphs, improving the worst-case running time of the known algorithm Kawahara2017-ey.
Let and be caterpillar trees in such that
Then for and :
We refer to pairs , as defined in Theorem 4, as transpositions. The reason for this is that caterpillar trees can be seen as permutations of the set , ordered by the ranks of their parents. The tree in the theorem then corresponds to the identity permutation . Note that there is no one-to-one correspondence between permutations and caterpillar trees. For example the permutation corresponds to the tree as well. Therefore, the two pairs of leaves sharing their parent in and , respectively, are not in the set .
Let and be caterpillar trees as described above and . For proving , it is sufficient to show that for all caterpillar trees that are neighbour of it is
The fact that inequality (1) implies can be established by induction as in [Theorem 1]Collienne2020-iu.
For proving inequality (1) we first establish and then , assuming that is a caterpillar tree that is an neighbour of . We then show that and cannot both be true simultaneously, which proves inequality (1).
At most one transposition of can be resolved in because the only move possible between caterpillar trees and is an move exchanging two leaves. Hence . Let and be the leaves that exchange their position between and , such that . Since these are the only leaves that change positions between and , they are the only elements that could be in . It remains to show , from which we can conclude that . We prove this by showing that if , it follows .
We assume that , implying , so at least one of the following conditions must be violated for :
At first we consider the case that (C1) is violated for in . Then there is an such that and . It immediately follows that the same condition is violated for in , because the move exchanging and preserves the relationship of and . It hence is , contradicting our assumption .
We can therefore assume that (C2) is violated for . It follows . As only and exchange between and and , it follows . This however results in a violation of (C2) for in and hence . We can conclude , and hence .
It remains to show that and cannot be true at the same time. We assume , hence is a transposition in , meaning that and . As and are the only leaves that could be in , it suffices to show that neither of them actually is in , resulting in .
That is not in follows from the violation of (C1) by and . It hence is . Moreover, if , it follows as explained in the following. If , both conditions (C1) and (C2) are met by in . With and it immediately follows that these conditions are also met in , and hence , and therefore .
Summarising, it is , and we can conclude that if , it is , which concludes this proof. ∎
The distance between two caterpillar trees can be computed in in .
By Theorem 4 the distance between two caterpillar trees in is the number of transpositions between two sequences of length minus as defined in Theorem 4. The value can be computed in time linear in for any caterpillar tree by considering the leaves of the tree ordered according to increasing rank of their parents. The number of transpositions of a sequence of length (Kendall-tau distance) can be computed in time Chan2010-ls. This number is equal‘ to , as defined in Theorem 4, when ignoring transpositions for the pairs of leaves sharing a parent in and , respectively. The worst-case running time for computing the distance between caterpillar trees is therefore . ∎
4.3. Diameter and Radius
In this section we investigate the diameter of and , which is the greatest distance between any pair of trees in each of these graphs, respectively, i.e. . We first establish the exact diameter of , improving the upper bound given by Gavryushkin2018-ol. Afterwards, we generalise this result to .
The diameter of is .
For proving this theorem we use the fact that FindPath computes shortest paths in . Each iteration of FindPath, applied to two ranked trees and , decreases the rank of the most recent common ancestor of a cluster , induced by the node of rank in , in the currently last tree on the already computed path (starting wth ). The maximum rank of at the beginning of iteration is , the rank of the root. As every move decreases the rank of by one, there are at most moves in iteration . The maximum length of a shortest path in is hence . Note that the caterpillar trees and provide an example of trees that have distance , as already pointed out in [Corollary 1]Collienne2020-iu, proving that this upper bound for the length of a shortest path is tight. ∎
The diameter of is .
In order to prove the diameter of , we consider the longest path that FindPath can compute between any two trees and . That is, we consider the maximum number of moves that FindPath can perform on the extended ranked versions and of any two trees and . Therefore, we distinguish moves in the subtrees on the leaf set from the rank moves corresponding to length moves, i.e. rank moves between one node of each of the subtrees on leaf subsets and .
The maximum number of moves (excluding rank moves corresponding to length moves) on follows from Theorem 5 and is