 # Optimal Algorithm to Reconstruct a Tree from a Subtree Distance

This paper addresses the problem of finding a representation of a subtree distance, which is an extension of the tree metric. We show that a minimal representation is uniquely determined by a given subtree distance, and give a linear time algorithm that finds such a representation. This algorithm achieves the optimal time complexity.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and the results

A phylogenetic tree

represents an evolutionary relationship among the species that are investigated. Estimating a phylogenetic tree from experimental data is a fundamental problem in phylogenetics

wiley2011phylogenetics . One of the commonly used approaches to achieve this task is the use of a distance-based method. In this approach, we first compute the dissimilarity (i.e., a nonnegative and symmetric function) between the species by, e.g., the edit distance between the genome sequences. Then, we find a weighted tree having the shortest path distance that best fits the given dissimilarity. The most popular method for this approach is the neighbor-joining method saitou1987neighbor .

A weighted tree is specified by the set of vertices , the set of edges , and the nonnegative edge weight . Let us consider the case in which a given dissimilarity exactly fits some weighted tree; i.e., there exists a weighted tree and a mapping such that

 d(x,y)=dT(ψ(x),ψ(y))(x,y∈X), (1.1)

where is the distance between and in for . In this case, the dissimilarity is called a tree metric, and the pair is called a representation of . It is known that a dissimilarity is a tree metric if and only if it satisfies an inequality called the four-point condition zaretskii1965constructing ; buneman1971recovery , which is given by

 d(x,y)+d(z,w)≤max{d(x,z)+d(y,w),d(x,w)+d(y,z)} (1.2)

for any . For any tree metric , there exists a unique minimal representation of buneman1971recovery . Here, a representation is minimal if there is no representation of , such that is obtained by removing some vertices and edges and/or by contracting some edges of (i.e., is a proper minor of ). Furthermore, such a representation is constructed in time culberson1989fast , where .

In some applications, we are interested in the distance between groups of species (e.g., genus, tribe, or family). In such a case, we aim to identify a group as a connected subgraph in a phylogenetic tree. The subtree distance, which was introduced by Hirai hirai2006characterization , is an extension of the tree metric that can be adopted for use in such situations. A function is called a subtree distance if there exists a weighted tree and a mapping such that induces a subtree of (i.e., a connected subgraph of ) for and equations

 d(x,y)=dT(ϕ(x),ϕ(y))(x,y∈X), (1.3)

hold, where for . We say that a pair is a representation of . Note that a subtree distance is not necessarily a metric because it may not satisfy the non-degeneracy ( for ) and the triangle inequality (). Hirai proposed a characterization of subtree distances in which a dissimilarity is a subtree distance if and only if it satisfies an inequality called the extended four-point condition, which is given by

 d(x,y)+d(z,w)≤max⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩d(x,z)+d(y,w),d(x,w)+d(y,z),d(x,y),d(z,w),d(x,y)+d(y,z)+d(z,x)2,d(x,y)+d(y,w)+d(w,x)2,d(x,z)+d(z,w)+d(w,x)2,d(y,z)+d(z,w)+d(w,y)2⎫⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎭ (1.4)

for any . The extended four-point condition yields an time algorithm to recognize a subtree distance.

This paper addresses the following problem, which we call the subtree distance reconstruction problem.

###### Problem 1.

Given a subtree distance on a finite set , find a representation of .

Ando and Sato ando2017algorithm proposed an time algorithm for this problem. Their algorithm consists of three steps: (1) identify a subset , (2) find a representation for the restriction of onto , and (3) for , locate in by examining a connected components of .

In this study, we propose the following theorems. We define a minimal representation for a subtree distance in the same manner as a minimal representation for a tree metric.

###### Theorem 2.

For a subtree distance , a minimal representation is uniquely determined by .

###### Theorem 3.

There exists an time algorithm that finds, for any subtree distance , its unique minimal representation, where .

The proof of Theorem 3 is constructive. Similar to ando2017algorithm , our algorithm consists of three parts: (1) identify the set of objects that are mapped to the leaves, (2) find the minimal representation for the restriction of onto , and (3) for , locate in by measuring the distances from the leaves. Since Steps 1 and 3 can be implemented with a time complexity of , and there is an time algorithm for Step 2 culberson1989fast , the total time complexity of the algorithm is . Note that even if we know is a tree metric, time is required to reconstruct a tree hein1989optimal . Therefore, our algorithm achieves the optimal time complexity.

This algorithm can also be used to recognize a subtree distance by checking the failure or inconsistency during the process and by verifying equations (1.3) after the reconstruction.

###### Corollary 4.

There exists an time algorithm that determines whether a given input is a subtree distance or not, where .

## 2 Proofs

We assume that there are no objects such that for all . This assumption is satisfied by removing such elements after lexicographic sorting, which requires time wiedermann1979complexity . Clearly, this preprocessing does not change the minimal representation. We also assume that . Otherwise, the theorems trivially hold.

First, we prove Theorem 2. We identify the properties of a minimal representation.

###### Lemma 5.

Let be a minimal representation of a subtree distance . Then, the following properties hold.

1. For each edge , the length of is positive.

2. For each leaf vertex , there exists such that .

###### Proof.

1. If there is an edge of zero length, we can contract from the representation.

2. Let be a leaf vertex of . If there is no with , we can remove from the representation to obtain a smaller representation. Suppose that, for all with , contains at least two elements. Then, these s contain the unique adjacent vertex of . Since for all , does not contribute any shortest paths in the tree. Therefore, we can remove from the representation. ∎

Motivated by Property 2 in Lemma 5, we introduce the following definition. For a minimal representation , an object is a leaf object if for some leaf .

We defined a leaf object by specifying a minimal representation. However, as shown below, the set of leaf objects is uniquely determined by . First, we show that there exists an object that is a leaf object for any minimal representation.

###### Lemma 6.

Let . Then, for any minimal representation, and are leaf objects.

###### Proof.

For any tree, the farthest pair is attained by a pair of leaves. ∎

Next, we show that the leaf objects are characterized by and a leaf object .

###### Lemma 7.

Let be a minimal representation of subtree distance , and let be a leaf object. An object is a leaf object if and only if for all .

###### Proof.

(The “if” part). Suppose is not a leaf object. Then, there is another leaf object such that the path from to intersects . By considering distances among , and on this path, we have .

(The “only if” part). Suppose is a leaf object. Then, for any the shortest path from to never intersects . Thus, we have . Here, we used the assumption that there is no such that for all . ∎

Since we can take a leaf object universally by Lemma 6, and the condition in Lemma 7 is described without specifying the underlying representation, we can conclude that the set of leaf objects is uniquely determined by . Thus, we obtain the following corollary.

###### Corollary 8.

Any minimal representation of a subtree distance has the same leaf objects. ∎

Now, we observe that the dissimilarity obtained by restricting on the set of leaf objects forms a tree metric because the leaf objects are mapped to singletons. Since the minimal representation of a tree metric is unique buneman1971recovery , any minimal representation of has the same topology as the minimal representation of .

The remaining issue is to show that each non-leaf object is uniquely located in the minimal representation of . This is clear because any connected subgraph in a tree is uniquely identified by the distances from the leaves. More precisely, we obtain the following explicit representation.

We first consider as a continuous object. We fix a leaf object . For each leaf object , there exists a unique path from to in . Let be the interval on the path having a distance of at least from and at least from , i.e.,

 I(r,a;x,b)={u∈path(ϕ(r),ϕ(x)):dT(ϕ(r),u)≥a,dT(ϕ(x),u)≥b}. (2.1)

For , we denote by the subgraph of induced by . Note that both and are continuous objects. By using these notations, we obtain the following.

###### Lemma 9.

Let be a subtree distance and be the set of leaf objects. Let be the minimal representation of . Fix a leaf object . Then, we have for each non-leaf object

 ¯¯¯¯¯¯¯¯¯¯ϕ(z)=⋃x∈L∖{r}I(r,d(r,z);x,d(x,z)). (2.2)
###### Proof.

Since is a connected subgraph, the intersection of and the path from to is the interval . Since any tree is covered by the paths from a fixed leaf and the other leaves for , we have

 ¯¯¯¯¯¯¯¯¯¯ϕ(z) =⎛⎝⋃x∈L∖{r}path(ϕ(r),ϕ(x))⎞⎠∩¯¯¯¯¯¯¯¯¯¯ϕ(z) =⋃x∈L∖{r}(path(ϕ(r),ϕ(x))∩¯¯¯¯¯¯¯¯¯¯ϕ(z)) =⋃x∈L∖{r}I(r,d(r,z);x,d(x,z)). (2.3)

By placing the vertices on the boundaries of for and then letting be the vertices intersecting for , we obtain the minimal representation of . This completes the proof of Theorem 2. Note that the number of the vertices of the minimal representation is since each has at most boundaries.

Now, we prove Theorem 3. The above proof of Theorem 2 is constructive, and it provides the following algorithm:

1. Identify the set of leaf objects by Lemmas 6 and 7.

2. Find the minimal representation of by the existing algorithm.

3. Locate the non-leaf objects by Lemma 9.

We evaluate the time complexity of this algorithm. Step 1 is conducted in time for finding a leaf object and time for finding other leaf objects. Step 2 is performed in time by using Culberson and Rudnicki’s algorithm culberson1989fast . Also, Step 3 is performed in time by equation (2.2) since, for each , it processes each edge at most once. Hence, Theorem 3 is proved.

## Acknowledgments

The authors thank Hiroshi Hirai and anonymous referees for helpful comments. This work was supported by the Japan Society for the Promotion of Science, KAKENHI Grant Number 15K00033.

## References

•  Kazutoshi Ando and Koki Sato. An algorithm for finding a representation of a subtree distance.

Journal of Combinatorial Optimization

, 36(3):742–762, 2018.
•  Peter Buneman. The recovery of trees from measures of dissimilarity. In D.G. Kendall and P. Tautu, editors, Mathematics in the Archeological and Historical Sciences, pages 387–395. Edinburgh University Press, 1971.
•  Joseph C. Culberson and Piotr Rudnicki. A fast algorithm for constructing trees from distance matrices. Information Processing Letters, 30(4):215–220, 1989.
•  Jotun J. Hein. An optimal algorithm to reconstruct trees from additive distance data. Bulletin of Mathematical Biology, 51(5):597–603, 1989.
•  Hiroshi Hirai. Characterization of the distance between subtrees of a tree by the associated tight span. Annals of Combinatorics, 10(1):111–128, 2006.
•  Naruya Saitou and Masatoshi Nei. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Molecular Biology and Evolution, 4(4):406–425, 1987.
•  Juraj Wiedermann. The complexity of lexicographic sorting and searching. In Proceedings of the 8th International Symposium on Mathematical Foundations of Computer Science, pages 517–522, 1979.
•  Edward O. Wiley and Bruce S. Lieberman. Phylogenetics: Theory and Practice of Phylogenetic Systematics. John Wiley & Sons, 2011.
•  K. A. Zaretskii. Constructing a tree on the basis of a set of distances between the hanging vertices. Uspekhi Matematicheskikh Nauk, 20(6):90–92, 1965.