A phylogenetic tree is a tree describing the evolution of a set of entities (species, genes etc.), which will be called taxa from now onwards. Degree-one nodes are called leaves and a bijective function associates each taxon to a leaf. Internal nodes represent putative ancestral taxa and branch lengths quantify the evolutionary distances between nodes.
Tree comparison metrics provide a quantitative measure of the similarity or difference between two phylogenetic trees. They have proven invaluable for statistical testing (e.g. [11, 9, 13]), for visualisation , and for the construction of consensus trees [14, 3, 10]. By far the most well-known tree comparison metric is the Robinson-Foulds metric , which equals the number of bipartitions111A bipartition with is in a phylogenetic tree if there exists an edge such that its removal creates two trees with taxon sets and .that are in one tree and not the other. However many other different metrics have also been proposed, each one based on a different characteristic of the trees being compared.
Here we consider pairs of trees on the same set of taxa. Also, our trees are binary, i.e. each internal node has degree three. The path-length between two taxa in a phylogenetic tree is the sum of the branch lengths along the unique path between them. The path-length distance between two trees and is given by
where is the path length between taxa and in the first tree and is the path length in the second tree. We note that is a metric in the mathematical sense. The first explicit description of the metric appears in  (without branch lengths) and  (with branch lengths), though closely related ideas appear much earlier (e.g. [7, 6, 15]).
Given a phylogeny with leaves, it takes time to construct the set of all path-lengths , using the dynamic programming algorithm presented in . Hence the path-length distance can be easily computed in time. Our main contribution in this paper is to show that we can compute this distance in time, which is almost, but not quite, linear in the size of the problem input.
Expanding (1) gives
The first two terms can be evaluated in linear time using dynamic programming, as outlined in Section 2. To compute the second term efficiently we first introduce a tree decomposition technique (Section 3) allowing the sum to be evaluated in time (Section 4). Both the tree decomposition and algorithm of Section 4 draw heavily on an algorithm of  for computing the quartet distance between two trees.
2 Sums of squared distances
In this section we show how to compute the sum of squared distances in a tree in linear time. We begin by introducing some notation that will be used in the rest of the paper.
Select an arbitrary leaf and consider both and as rooted trees with root . We think of being at the top of the tree and the other leaves being at the bottom of the tree. For any two edges we write if the path from to passes through . We write if and . Hence if is the edge incident with the root then for all other edges . We say that is external if it is incident to a leaf other than ; otherwise is internal. When is internal let and denote the edges incident and immediately below .
We will use to denote edges in and to denote edges in . We let denote the length of an edge in and the length of an edge in . Let denote the set of edges on the path from to in and let denote the corresponding set in . Hence
Let denote the number of leaves such that the path from to passes through . Define
Given two edges we let
the number of pairs having both and on the path between them. Then
If then . Hence, for any we have
If and then . Furthermore there is an edge with children such that, without loss of generality, and . For such an edge we have
Summing up over all in (6) we have
and the result follows. ∎
The sum can be computed in linear time.
If is external, and . Otherwise
Hence with a post-order traversal of the tree we can compute and for all edges in time. Computing the sum takes a further time by Proposition 1. can be computed in the same way. ∎
3 Segment decomposition
In this section we introduce a hierarchical decomposition of the edge set of that forms the structure used in our dynamical programming algorithm in Section 4.
Let be a connected subset of , the set of edges of . We define the boundary of to be the set of vertices incident both to edges within and to edges outside :
The degree of is the cardinality of . A segment of is a connected subset of with degree at most two.
A segment decomposition for is a binary tree such that
The leaves of correspond to edges in (i.e. minimal segments);
Each node of corresponds to a segment of ;
The segment corresponding to an internal node of equals the disjoint union of the segments corresponding to its children.
An example of segment decomposition is given in Figure 1.
The main result in this section is that we can, in linear time, construct a segment decomposition for with height .
The definition of a segment decomposition is based on the tree decomposition used by  to compute quartet-based distances, which in turn are based on techniques for efficient parsing of expressions [1, 5]. The main difference with  is that the segment decomposition is based on partitioning the set of edges, rather that the set of vertices, and that we were able to obtain a tighter bound on the height.
Our algorithm for constructing is agglomerative: we start with a degree one vertex for each edge in ; these form the leaves of . Each iteration, we identify pairs of maximal nodes corresponding to pairs of segments which can be combined to give new segments. We make the nodes in each pair children of a new node. The process continues until one node remains and is complete.
The following Proposition shows that in any partition of into segments we can always find a large number of pairs of disjoint segments which can be merged to give other segments.
Let be a binary tree. Let be a collection of segments which partition . Then there are at least non-overlapping pairs such that and is a segment of .
Let be the graph with vertex set
and edge set
Decompose into maximal paths which contain no degree three vertices in their interiors. For each , let be the set of elements such that . The sets partition .
Fix one path . We order the elements of lexicographically with respect to the indices of their boundary vertices. In other words, if satisfy and (where we might have or ) then we write if or and . With this ordering, if and are adjacent then is connected and has degree at most two. Hence by pairing off and , and , and so on, we can construct disjoint pairs. An example is given in Figure 2.
The total number of pairs we obtain this way is given by . We will determine a lower bound for this sum. Let be the number of degree three vertices in . Since is connected and acyclic there are paths which contain a degree one vertex in and paths which do not. If contains a degree one vertex then contains at least one component with degree two and another component with boundary equal to the degree one vertex, so . If contains no degree one vertices then is at least one. Let denote the number of paths which contain a degree one vertex and for which
is odd (and hence at least three). We have
as well as and .
We have that is even for at least paths ending in a degree one vertex, and for these paths . Thus
To bound the right hand side, note that the linear program
has solution , and so . Hence
and , the number of pairs, is bounded below by . ∎
We can now state the algorithm for constructing . Initially is a set of isolated vertices. As the algorithm progresses, vertices are combined into larger trees, so that each iteration is a forest. The algorithm terminates when contains a single tree.
At each iteration let denote the partition of the edge set of into segments corresponding to the maximal elements of the incomplete tree . Rather than store this partition explicitly, we maintain a linked list of boundary nodes. For each element in the list we maintain pointers to maximal nodes corresponding to segments in having in their boundaries. In addition, we maintain pointers from each node in to the boundary nodes of the corresponding segments.
Initialize with a forest of degree-one vertices corresponding to each edge of . Hence we initialise with one element for each vertex in , with the associated pointers. At this point, is the partition of putting each edge into a separate block.
While is disconnected do
Using the construction in Proposition 3 determine a set of at least pairs of disjoint elements of such that has at most two boundary points.
For each pair , , create a new node of corresponding to and with children corresponding to and .
Update the list of boundary vertices and the associated pointers.
We can construct a segment decomposition tree for with height in time.
We only merge nodes if the union of their corresponding segments is also a segment. Hence will be a segment decomposition tree. It remains to prove the bound on height and running time.
We note that reduces by a factor of each iteration. Hence the number of iterations is at most , which is also a bound on the height of the tree.
Using the list of boundary points we can construct construct and identify pairs, in time each iteration. Thus the total running time is at most time. ∎
We can strengthen the height bound. We say that a tree is -locally balanced if, for all nodes in the tree, the height of the subtree rooted at is at most . As the algorithm can be applied recursively on each node of we have that the global height bound applies to each node. Hence
The segment decomposition is ()- locally balanced.
4 Computing the inner product
In this section we show that can be computed in time, so that the main result follows from Eq. (2).
A (taxon) colouring is an assignment of the colors black and white to the taxa. For each edge of we let denote a coloring assigning black to those taxa on one side of and white to those on the other. For each edge in and each colouring of the set of taxa, we let denote the number of pairs of taxa such that and have different colours and they label leaves on different sides of .
For the remainder of this section we will assume that the vertices in are indexed . The actual ordering does not matter; it is only used to help presentation.
Let be the segment decomposition tree constructed for using the Algorithm in Section 3. For each node of we let denote corresponding segment in . The overall strategy at this point is to compute values for each node in which will allow us to: (i) compute, for an initial choice of , the sum in linear time, and (ii) update this computation efficiently as we iterate in a particular way through edges of .
We will store three pieces of information at every non-root node of , the exact type of information stored being dependent on the degree of the segment corresponding to .
If is degree one then we store:
Two integer counts
A description (e.g. coefficients) for a quadratic polynomial with two variables.
If has degree two then we store:
Two integer counts
A description (e.g. coefficients) for a quadratic polynomial with four variables.
We now show how the values and are computed using a colouring of the taxa. We start at the leaves of and work upwards towards the root.
Suppose that is a leaf of , so that contains a single edge of . There are two cases.
The edge is incident with a leaf of , so has degree one. If is black then and , while if is white we have and . In either case
The edge is not incident with a leaf of , so has degree two. Then and
Now suppose that is an internal vertex of . Once again there are several cases, however in all cases we have
Suppose and have degree one. Then
Suppose has degree two and has degree one, where and .
(a) If has degree one and then (14) (b) If has degree one and then (15) (c) If has degree two and then (16) (d) If has degree two and then (17)
The case when has degree one and has degree two is symmetric.
Suppose that and have degree two, that and . We can assume that since the alternative case follows by symmetry. This leaves three possibilities:
(a) If and then (19) (b) If and then (20) (c) If and (hence) then (21)
An illustration for several of these cases can be found in Figure 3 below.
Suppose that and have been computed as above for all nodes of except the root. Let and be the children of the root of . Then
For any node of we let denote the set of leaves of not incident with an edge of . If has degree two and boundary , , then we let be the leaves in which are closest to and the leaves in which are closest to . Let be any colouring of the leaves of , possibly distinct from . Let and be the sets of leaves that colours black and white respectively.
We will establish the following claims for all nodes in , using induction on the height of the node.
and are the number of leaves incident with edges in which are coloured black and white by (and hence by ).
If has degree one, and then
If has degree two, , , , and then
We start by considering any leaf of . In this case, contains a single edge . If is an edge incident to a leaf coloured white then , as required, and equals the number of leaves coloured black by , so
The same holds if the leaf is coloured black.
If the edge is internal then , and is equal to the number of paths crossing connecting leaves with different colours, or
Now consider the case when is an internal node of , other than the root. Let and be the two children of . Note that is the disjoint union of and , so and , proving (C1).
Furthermore, we have
If has degree one then, by the induction hypothesis,
where and are the numbers of leaves coloured black and white that are not incident with edges in . Similarly, if has degree two then, by the induction hypothesis,
where and are the numbers of leaves coloured black and white that are not incident with edges in and are closer to the boundary vertex of with the smallest index, while
and are the numbers of leaves coloured black and white that are not incident with edges in and are closer to the boundary vertex of with the largest index. The symmetric result holds for .
The different cases in Eq. (12) to Eq. (21) now correspond to the different counts for or for depending on whether and have degree one or two, and whether the boundary vertices in common had the highest or lower index for each segment.
Now suppose that and are the children of the root of . Then so and must both have degree one. We have that is the disjoint union of and . Any leaf not incident to a leaf in is incident to a leaf in and vice versa. Hence as required. ∎
Evaluating Eq. (12) to Eq. (21) takes constant time and space per each node of , since we manipulate and store a constant number of polynomials with at most four variables and total degree at most two. Thus, evaluating Eq. (12) to Eq. (21) takes time and space for each colouring, and since we want to sum this quantity over all colourings from edges a naive implementation would still take time. The key to improving this bound is in the use of efficient updates.
Suppose that we have computed , and the functions for all , using a leaf colouring . Let be a colouring which differs from at leaves. Then we can update the values , and the functions in time.
Let be the set of edges of which are incident to a leaf for which and have a different colour, so . The only nodes in which need to updated are those with for some . This is a union of the paths from leaves of to the root of , and so by Lemma 2 of  , it has size . ∎
The final step is to show that we can navigate the edges in so that the total number of changes in the colourings is bounded appropriately. Suppose that is rooted at the leaf (the same as ). For each internal node in we let Small() denote the child of with the smallest number of leaf descendants and let Large() denote the child with the largest number of leaf descendants, breaking ties arbitrarily.
The following recursive procedure returns the sum of
over all edges . Initially we let be the edge incident with the root . Let be the colouring where is black and all other leaves white. We initialise and fill out the values , and for all nodes of using the colouring . We then call Sum() where is the unique internal node adjacent to .