Ranking top-k trees in tree-based phylogenetic networks

04/29/2019
by   Momoko Hayamizu, et al.
Kyoto University
0

'Tree-based' phylogenetic networks proposed by Francis and Steel have attracted much attention of theoretical biologists in the last few years. At the heart of the definitions of tree-based phylogenetic networks is the notion of 'support trees', about which there are numerous algorithmic problems that are important for evolutionary data analysis. Recently, Hayamizu (arXiv:1811.05849 [math.CO]) proved a structure theorem for tree-based phylogenetic networks and obtained linear-time and linear-delay algorithms for many basic problems on support trees, such as counting, optimisation, and enumeration. In the present paper, we consider the following fundamental problem in statistical data analysis: given a tree-based phylogenetic network N whose arcs are associated with probability, create the top-k support tree ranking for N by their likelihood values. We provide a linear-delay (and hence optimal) algorithm for the problem and thus reveal the interesting property of tree-based phylogenetic networks that ranking top-k support trees is as computationally easy as picking k arbitrary support trees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/14/2018

A structure theorem for tree-based phylogenetic networks

Attempting to recognize a tree inside a phylogenetic network is a fundam...
11/14/2018

A structural characterization of tree-based phylogenetic networks

Attempting to recognize a tree inside a network is a fundamental underta...
07/10/2019

About Fibonacci trees. II – generalized Fibonacci trees

In this second paper, we look at the following question: are the propert...
02/08/2019

Faster Repetition-Aware Compressed Suffix Trees based on Block Trees

Suffix trees are a fundamental data structure in stringology, but their ...
06/20/2018

The compressions of reticulation-visible networks are tree-child

Rooted phylogenetic networks are rooted acyclic digraphs. They are used ...
09/05/2018

Randomized Incremental Construction of Net-Trees

Net-trees are a general purpose data structure for metric data that have...
01/20/2022

Statistical Depth Functions for Ranking Distributions: Definitions, Statistical Learning and Applications

The concept of median/consensus has been widely investigated in order to...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Although phylogenetic trees have been used as the standard model of evolution, phylogenetic networks have become popular amongst biologists as a tool to describe conflicting signals in data or uncertainty in evolutionary histories [4, 6, 9]. Therefore, when we wish to reconstruct the phylogenetic tree on a set of species from non-tree-like data, a natural idea would be to describe the data using a phylogenetic network on and then remove extra arcs to discover an embedding of inside , where is called a ‘support tree’ of  [6].

However, the above strategy only makes sense when is ‘tree-based’, namely, is merely a tree with additional arc [6], which is not always the case [12]. In [6], Francis and Steel provided a linear-time algorithm for finding a support tree of if is tree-based and reporting that it does not exist otherwise. Another linear-time algorithm for this decision problem was obtained by Zhang in [13].

While Francis and Steel’s work was followed by many studies (e.g.[1, 3, 4, 5, 7, 11, 13]), Hayamizu’s recent work [8] significantly advanced our understanding of how tree-based networks could be useful in contemporary phylogenetic analysis. In fact, Hayamizu’s structure theorem has derived a series of linear-time and linear-delay algorithms for many basic problems (e.g., counting, enumeration and optimisation) on support trees, and has thus enabled various data analysis using tree-based phylogenetic networks (see [8] for details).

In the present paper, we consider a so-called ‘top- ranking problem’, with the aim to further facilitate the application of tree-based phylogenetic networks. The problem is as follows: given a tree-based phylogenetic network where each arc exists in the true evolutionary lineage with probability , list top- support trees of in non-increasing order by their likelihood values. We note that this problem is an important generalisation of the top- ranking problem, which asks for a maximum likelihood support tree of and can be solved in linear time [8], since nearly optimal support trees can provide more biological insights than the maximum likelihood one.

At first glance, ranking top- support trees may seem more difficult than picking arbitrary support trees, the latter of which is possible with linear delay [8]; however, in this paper, we provide a linear-delay (i.e., optimal) algorithm for the top- ranking problem and thus reveal that the above two problems have the same time complexity, which is an interesting property of tree-based phylogenetic networks.

2. Preliminaries

Throughout this paper, represents a non-empty finite set of present-day species. All graphs considered here are finite, simple, directed acyclic graphs. For a graph , and denote the sets of vertices and arcs of , respectively. A graph is called a subgraph of a graph if both and hold, in which case we write . When but , then is called a proper subgraph of . When and , is a spanning subgraph of . Given a graph and a non-empty subset of , is said to induce the subgraph of , that is, the one whose arc-set is and whose vertex-set consists of all ends of arcs in . For a graph with and a partition of , the collection of arc-induced subgraphs of is called a decomposition of . For an arc , and are called the tail and head of and are denoted by and , respectively. For a vertex of a graph , the in-degree of in , denoted by , is defined to be the cardinality of the set . The out-degree of in , denoted by , is defined in a similar manner. For any graph , a vertex with is called a leaf of .

Definition 2.1.

A rooted binary phylogenetic -network is defined to be a finite simple directed acyclic graph with the following properties:

  1. has a unique vertex with and ;

  2. is the set of leaves of ;

  3. for any , holds.

In Definition 2.1, the vertex is called the root of , and a vertex with is called a reticulation vertex of . When has no reticulation vertex, is called a rooted binary phylogenetic -tree.

Definition 2.2 ([6]).

If a rooted binary phylogenetic -network that has a spanning tree that can be obtained by inserting zero or more vertices into each arc of a rooted binary phylogenetic -tree , then is said to be tree-based and is called a support tree of .

Theorem 2.3 ([6]).

Let be a rooted binary phylogenetic -network and let be a subset of . Then, the subgraph of is a support tree of if and only if satisfies the following three conditions, in which case is called an ‘admissible’ arc-set of . Moreover, there exists a one-to-one correspondence between support trees of and admissible arc-sets of .

  1. contains all with or .

  2. for any with , exactly one of is in .

  3. for any with , at least one of is in .

In this paper, as the conditions in Theorem 2.3 still make sense for any subgraph of , we consider admissible arc-sets of subgraphs of .

3. Known results: the structure of support trees

Here, we summarise without proofs the relevant material in [8]. A connected subgraph of a tree-based phylogenetic -network with is called a zig-zag trail (in ) if there exists a permutation of such that for each , either or holds. Then, any zig-zag trail in is specified by an alternating sequence of (not necessarily distinct) vertices and distinct arcs of , such as , which can be more concisely expressed as or in reverse order. A zig-zag trail in is said to be maximal if contains no zig-zag trail such that is a proper subgraph of . A maximal zig-zag trail with even is called a crown if can be written in the cyclic form and is called a fence otherwise. Furthermore, a fence

with odd

is called an N-fence, in which case can be expressed as . A fence with even is called an M-fence if it can be written in the form , rather than .

From now on, we represent a maximal zig-zag trail by a sequence of the elements of that form the zig-zag trail in this order, assuming that no confusion arises. Then, we can encode an arbitrary arc-induced subgraph of by an

-dimensional vector. For example, for an N-fence

, the subgraph of induced by the subset is specified by the vector . With this notation, we can state Hayamizu’s structure theorem for tree-based phylogenetic networks, which gives an explicit characterisation of the family of all admissible arc-sets of as follows.

Theorem 3.1 ([8]).

Any tree-based phylogenetic -network is uniquely decomposed into maximal zig-zag trails , each of which is a crown, M-fence or N-fence. Moreover, a subgraph of is a support tree of if and only if is an admissible arc-set of for any . Furthermore, the collection of support trees of is characterised by a direct product of families of the admissible arc-sets of , namely, we have with

4. Top- support tree ranking problem

Given a tree-based phylogenetic -network where each arc is chosen with probability , we can assign a ranking number to each support tree of by the likelihood value . In principle, the top- support tree ranking problem for asks for an ordered set of support trees of such that holds for any support tree of other than (). However, such a ranking is not unique in general, since there can be ‘ties’ in the collection of support trees of as well as in the family of admissible arc-sets of each maximal zig-zag trail in . For convenience, we ensure the uniqueness of the ranking by using the lexicographical order on vectors as follows.

Assume that is a tree-based phylogenetic -network with as in Theorem 3.1 and that is any maximal zig-zag trail in . We define the local ranking for to be a totally ordered set such that for any , holds if either or holds. Note that the elements of are -dimensional vectors and any two of them are comparable lexicographically. From now, we identify the -th element of with its local ranking number in order to write . Then, the elements of are vectors having the same dimension again and so we can break ties by using as before. Abusing the notation slightly, we call the totally ordered set the support tree ranking (for ). For any with , the top- support tree ranking (for ) is defined to be a unique subsequence of the first elements of . Note that for any , one can determine in time whether or not holds [8].

Problem 4.1.

Top- support tree ranking problem
Input:
A tree-based phylogenetic -network with associated probability and not exceeding the number of support trees of .
Output: The top- support tree ranking for .

5. Results

As a preliminary step, we prove the following proposition about the local ranking.

Proposition 5.1.

For any maximal zig-zag trail in a tree-based phylogenetic -network with associated probability , the first element in the local ranking can be found in time. Moreover, given the -th element in , one can find the -th element in time.

Proof.

One can check in time whether is a crown, N-fence or M-fence. In the case when is a crown or N-fence, the local ranking for is trivial to compute as holds by Theorem 3.1. Assume that is an M-fence with . Also, let with for each and let for each . Then, holds for each . As one can obtain both and in time, computing the likelihood values for all requires time. This completes the proof. ∎

We define and for each . Recalling , we see that is a linear extension of the partially ordered set (i.e., implies ), where is the usual component-wise order on vectors (e.g., if and only if and ). We also note that this requires each to be an order ideal of (i.e., for any , implies ). These arguments lead to the following proposition.

Proposition 5.2.

Let be the top- support tree ranking for a tree-based phylogenetic -network with associated probability and let be as defined above. Then, holds, and for each , there exists with .

Let be the unit vector such that -th component is one and the others are all zeros. Also, for each , let be the first index such that the -th component of is strictly greater than one and let . For example, gives . Then, we have the next lemma, which is illustrated in Figure 1.

Lemma 5.3.

Let be the support tree ranking for a tree-based phylogenetic -network with associated probability and let be a graph with and . Then, is a spanning tree of the Hasse diagram of such that is the root of and implies .

Proof.

It is clear that implies (and hence ). By construction, is a tree rooted at because holds and for each , there exists a unique element with . This completes the proof. ∎

Figure 1. An illustration of Lemma 5.3. The top left is a tree-based phylogenetic -network whose arcs are associated with probability. The top right shows the maximal zig-zag trails in with and the likelihood of each element of (). On the bottom is the spanning tree (shown in bold) of the Hasse diagram of .

In what follows, for any , we write to mean the least element of . For any , let and . Also, for any , let , where represents a unique element with . We note that both and are possible to occur.

Lemma 5.4.

Let be the graph as in Lemma 5.3 and let be a subset of that is recursively defined by

(1)

Then, for each , we have and for all .

Proof.

Let and for . We will show that holds for any , which completes the proof, since and for all .

For , we have

where we assume that . Note that contains . This implies

For any , we have because holds. We thus obtain

From Equation (1) and , the desired conclusion follows. ∎

We are in a position to give an algorithm for Problem 4.1. As illustrated in Table 1, the algorithm starts by setting and and then returns for each , where is iteratively updated using Equation (1).

Table 1. Application of the proposed algorithm to the input in Figure 1 ().

In order to analyse the running time of the above algorithm, let us review some basics of a priority queue, which is a data structure for maintaining objects that are prioritised by their associated values. In its most basic form, a priority queue supports the operations called Insert and Delete-min, where the former refers to adding a new object, and the latter to detecting and deleting the one with the highest-priority [2]. Implemented with a binary heap, each of these operations can be performed in time, where denotes the number of the elements in the priority queue [2].

Theorem 5.5.

The top- support tree ranking problem (Problem 4.1) can be solved with linear delay, and hence in time.

Proof.

As Equation (1) implies that holds for any , holds for any . Then, if we keep the elements of each in a priority queue, time suffices to return and to delete from . Also, once and have been obtained, inserting the two elements requires time. We note that follows from . By Proposition 5.1, for each , one can compute in time, which equals time as is a decomposition of . Hence, our algorithm can return one after the other in such a way that the delay between two consecutive outputs is time. This completes the proof. ∎

Finally, we make two remarks. First, time is required to output distinct support trees of as each support tree has size . Therefore, the running time of our algorithm (as well as that of the enumeration algorithm in [8]) is , which guarantees the optimality of those algorithms. Second, as commonly in the literature (e.g.[10]), it would be natural to wonder about the time complexity of an analogue of Problem 4.1 that only asks for outputting a sequence of the differences between and ; however, we note that this problem still requires time because the size of each difference is . To illustrate this, consider a tree-based phylogenetic -network that is decomposed into maximal fences, each of which has only one admissible arc-set, and crowns, each of which has size . The difference between any two support trees has size , which equals if is a constant.

References

  • [1] M. Anaya, O. Anipchenko-Ulaj, A. Ashfaq, J. Chiu, M. Kaiser, M. S. Ohsawa, M. Owen, E. Pavlechko, K. St. John, S. Suleria, K. Thompson, and C. Yap, On determining if tree-based networks contain fixed trees, Bulletin of Mathematical Biology 78 (2016), no. 5, 961–969.
  • [2] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to algorithms, MIT press, 2009.
  • [3] M. Fischer, M. Galla, L. Herbst, Y. Long, and K. Wicke, Non-binary treebased unrooted phylogenetic networks and their relations to binary and rooted ones, arXiv:1810.06853 [q-bio.PE] (2018).
  • [4] A. Francis, K. T. Huber, and V. Moulton, Tree-based unrooted phylogenetic networks, Bulletin of mathematical biology 80 (2018), no. 2, 404–416.
  • [5] A. Francis, C. Semple, and M. Steel, New characterisations of tree-based networks and proximity measures, Advances in Applied Mathematics 93 (2018), 93–107.
  • [6] A. R. Francis and M. Steel, Which phylogenetic networks are merely trees with additional arcs?, Systematic Biology 64 (2015), no. 5, 768–777.
  • [7] M. Hayamizu, On the existence of infinitely many universal tree-based networks, Journal of Theoretical Biology 396 (2016), 204–206.
  • [8] by same author, A structure theorem for tree-based phylogenetic networks, arXiv:1811.05849 [math.CO] (2018).
  • [9] D. H. Huson, R. Rupp, and C. Scornavacca, Phylogenetic networks: concepts, algorithms and applications, Cambridge University Press, 2010.
  • [10] S. Kapoor and H. Ramesh, Algorithms for enumerating all spanning trees of undirected and weighted graphs, SIAM Journal on Computing 24 (1995), no. 2, 247–265.
  • [11] J. C. Pons, C. Semple, and M. Steel, Tree-based networks: characterisations, metrics, and support trees, Journal of Mathematical Biology 78 (2019), no. 4, 899–918.
  • [12] L. van Iersel, Different topological restrictions of rooted phylogenetic networks. Which make biological sense?, http://phylonetworks.blogspot.nl/2013/03/different-topological-restrictions-of.html, 2013, Accessed: 2019-03-16.
  • [13] L. Zhang, On tree-based phylogenetic networks, Journal of Computational Biology 23 (2016), no. 7, 553–565.

Acknowledgement

The first author acknowledges support from JST PRESTO Grant Number JPMJPR16EB.