# A structural characterization of tree-based phylogenetic networks

Attempting to recognize a tree inside a network is a fundamental undertaking in evolutionary analysis. Therefore, the notion of tree-based phylogenetic networks, which was introduced by Francis and Steel, has attracted much attention of researchers in the area of theoretical biology in the last few years. Tree-based networks can be viewed as a natural generalization of rooted binary phylogenetic trees because they are merely trees with additional arcs, and in defining those networks, a certain kind of spanning trees called subdivision trees plays an essential role. In this paper, we provide a structural characterization of tree-based networks that furnishes efficient algorithms for solving the following problems in linear time (for enumeration, in linear delay): given a rooted binary phylogenetic network N, 1) determine whether or not N is tree-based and find a subdivision tree if there exists any (decision/search problem); 2) compute the number of subdivision trees of N (counting problem); 3) list all subdivision trees of N (enumeration problem); and 4) find a subdivision tree to maximize or minimize a prescribed objective function (optimization problem). Our structural result settles numerous questions including the complexity of the problem of counting subdivision trees that was left open in the paper of Francis and Steel, and also provides short proofs of different known results from a unifying point of view. The results in this paper still hold for a certain class of non-binary networks. Some applications and further research directions are also mentioned.

## Authors

• 4 publications
11/14/2018

### A structure theorem for tree-based phylogenetic networks

Attempting to recognize a tree inside a phylogenetic network is a fundam...
04/29/2019

### Ranking top-k trees in tree-based phylogenetic networks

'Tree-based' phylogenetic networks proposed by Francis and Steel have at...
11/22/2018

### Recognizing Graph Search Trees

Graph searches and the corresponding search trees can exhibit important ...
05/12/2021

### Isomorphic unordered labeled trees up to substitution ciphering

Given two messages - as linear sequences of letters, it is immediate to ...
04/19/2018

### Entropy rates for Horton self-similar trees

In this paper we examine planted binary plane trees. First, we provide a...
07/12/2018

### Push-Down Trees: Optimal Self-Adjusting Complete Trees

Since Sleator and Tarjan's seminal work on self-adjusting lists, heaps a...
06/07/2021

### Deterministic Iteratively Built KD-Tree with KNN Search for Exact Applications

K-Nearest Neighbors (KNN) search is a fundamental algorithm in artificia...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Although phylogenetic networks are widely used to describe non-tree-like evolution or to represent conflicts in data or uncertainty in evolutionary histories (e.g.[2, 3, 7, 13]), phylogenetic trees are still regarded as a fundamental model of evolution for their ultimate simplicity. Therefore, it is an essential undertaking to try to recognize a tree within a phylogenetic network (e.g., [10, 15]), which gives a possible explanation for Francis and Steel’s philosophy behind their definition of “tree-based” phylogenetic networks [4] (see also [13]).

Intuitively, tree-based phylogenetic networks can be seen as a natural generalization of phylogenetic trees since they are merely trees with additional arcs [4]. In the last few years, tree-based networks have attracted much attention of theoretical biologists and their mathematical and computational aspects have been actively studied (e.g., [1, 6, 8, 11, 16]). In this context, although we will provide formal definitions later, the notion of “subdivision trees” plays an essential role because tree-based networks can be defined as those having at least one subdivision trees.

In the theory of computational complexity, the most fundamental type of questions is concerning the time complexity of a

decision/search problem, such as the problem of determining whether or not a given phylogenetic network is tree-based and finding a subdivision tree of if there exists any. In [4], Francis and Steel proved that this problem can be formulated as the 2-satisfiability (2-SAT) problem and provided a linear time algorithm for solving it. Then, in view of the fact that computing the number of satisfying solutions of 2-SAT is #P-complete [14], they conjectured that the problem of counting the number of subdivision trees might also be hard [4]. Besides this counting problem, there are several natural problems whose time complexity is still unknown, which are summarized below.

• Counting problem: given a rooted binary phylogenetic network , count the number of subdivision trees of . The time complexity of this problem was left open in [4]. This problem can be relevant to quantifying the complexity of phylogenetic networks as networks having many spanning trees tend to be more complex than those with only a few. While some results were obtained in [8, 11], but no polynomial time algorithm has been provided so far. It is known that a similar counting problem is #P-complete [10].

• Enumeration problem: given a rooted binary phylogenetic network , list all subdivision trees of . Although the time complexity of this problem is exponential in the size of the input [4], this does not deny the existence of a fast algorithm. In the usual context of algorithm theory, the complexity of enumeration problems is evaluated in terms of the size of both input and output. It is meaningful to analyze this problem more carefully because listing a designated number of solutions, rather than all, makes practical sense in many applications.

• Optimization problem: given a rooted binary phylogenetic network and a weighting of the arcs of , find a subdivision tree of

to maximize (or minimize) a prescribed objective function. From a statistical perspective, this problem can be interpreted as modeling the situation where a phylogenetic network and the probability of the arcs are given and we wish to estimate the best tree within the network to maximize the likelihood or log-likelihood.

In this paper, we will show that the above problems can be solved in linear time (for enumeration, in linear delay) as a consequence of a structural characterization of tree-based phylogenetic networks (Theorem 4.4). Our structural result has implications on problems other than the above; for instance, it implies that one can quickly list a designated number of solutions in non-increasing order according to the value of the objective function.

The remainder of this paper is organized as follows. We set up basic definitions and notation in Section 2, and give more formal description of the above-mentioned problems in Section 3. In Section 4, we begin with introducing some concepts essential to our results and then prove a structure theorem for tree-based phylogenetic networks. With this structural result in hand, in Section 5, we provide efficient algorithms for solving the above series of problems and also prove several known results on tree-based networks from a unifying point of view. Section 6 indicates possible directions of further studies; in particular, we suggest a generalization of the present work by mentioning that the results in this paper still hold for a certain kind of non-binary networks. The Appendix contains a numerical example to demonstrate how our algorithm solves the above counting problem and how this can give insights into the complexity of phylogenetic networks.

## 2. Preliminaries

Throughout this paper, denotes a non-empty finite set, and the terms “graph” and “network” all refer to finite, simple, acyclic digraphs (directed graphs), which we now define. A digraph

is an ordered pair

of a set of vertices and a set of arcs (i.e., directed edges). Given a digraph , we write and to represent the sets of vertices and arcs of , respectively. If and are finite sets, then is said to be finite. We use the notation for an arc oriented from a vertex to a vertex , and also write and to mean and , respectively. A digraph is said to be simple if holds for any and holds for any with . A simple digraph is said to be acyclic if has no cycle, namely, there is no sequence of three or more elements of such that holds for each , with indices taken .

For graphs and , is called a subgraph of if both and hold, in which case we write . A subgraph is said to be proper if we have either or . A subgraph is said to be spanning if holds. Given a graph and a subset , is said to induce the subgraph of , where denotes a set of the heads and tails of all arcs in . Besides, given a graph with and a partition of , the collection is called a decomposition of , where a partition of a set is defined to be a collection of non-empty disjoint subsets of whose union is .

For a vertex of a digraph , the in-degree and out-degree of in , denoted by and , are defined to be the cardinalities of the sets and , respectively. Given an acyclic digraph , a vertex is called a leaf of if holds.

###### Definition 2.1.

Given a finite set , a rooted binary phylogenetic -network is defined to be any finite simple acyclic digraph that has the following properties:

1. there exists a unique vertex of with and ;

2. is the set of leaves of ;

3. each vertex satisfies .

In Definition 2.1, the vertex is called the root of , which can be interpreted as the origin of all species that are signified by the leaves of . In addition, we call a tree vertex of if holds, and a reticulation vertex of otherwise. In the case when has no reticulation vertex, is called a rooted binary phylogenetic -tree.

###### Definition 2.2 ([4]).

For any rooted binary phylogenetic -network , a subdivision tree of is defined to be a spanning tree of that can be obtained from a rooted binary phylogenetic -tree by inserting zero or more vertices into each arc of .

###### Definition 2.3 ([4]).

Suppose is a subgraph of a rooted binary phylogenetic -network. We say that a subset of is admissible if satisfies the following conditions:

C0:

contains all with or .

C1:

for any with , exactly one of is in .

C2:

for any with , at least one of is in .

## 3. Motivating problems

Here, we provide the relevant background on “tree-based” phylogenetic networks in order to describe the problems to be addressed in this paper. We say that a rooted binary phylogenetic -network is tree-based if has at least one subdivision tree (see [4] for their original definition). Intuitively, tree-based networks can be viewed as a natural extension of rooted binary phylogenetic -trees because that they are merely trees with additional arcs. In [4], Francis and Steel gave an algorithmic characterization of this class of networks. More precisely, they proved that the following decision/search problem can be formulated as the 2-SAT problem and described a linear time algorithm for finding a subdivision tree of if there exists any as a consequence of Theorem 3.1. Thus, they have shown that the following decision/search problems can be solved in linear time.

###### Problem 1 ([4]).

Given a rooted binary phylogenetic -network , determine whether or not is a tree-based network on and find a subdivision tree of if there exists any.

###### Theorem 3.1 ([4]).

Given a rooted binary phylogenetic -network , any subdivision tree of is a subgraph of induced by an admissible subset of . Moreover, there exists a bijection between the families of admissible subsets of and of arc-sets of subdivision trees of .

As counting the number of satisfying solutions of 2-SAT is known to be #P-complete [14], a natural question arises as to whether the following counting problem can be solved in a polynomial time, which was left open in [4].

###### Problem 2 ([4]).

Given a rooted binary phylogenetic -network , count the number of subdivision trees of .

We will now describe some analogs to the above. The first question is whether there exists an efficient algorithm for the following enumeration problem.

###### Problem 3.

Given a rooted binary phylogenetic -network , list all subdivision trees of .

In general, the number of solutions of enumeration problems can be exponential in the size of input or even infinite. In the usual context of algorithm theory, therefore, the time complexity of listing combinatorial structures has been analyzed in terms of both input size and output size. In particular, polynomial-delay algorithms [9], which generate all solutions one after another such that the time between the output of any two consecutive solutions is bounded by a polynomial function in the input size, is considered as one of the most efficient classes of enumeration algorithms [5]. Those algorithms are fast indeed as their running time is linear with respect to the size of the output. Hence, even though the number of solutions of Problem 3 is exponential in the size of  [4], we can say that the problem is tractable if there exists a polynomial-delay algorithm for solving Problem 3.

The next question is whether the following optimization problem can be solved in polynomial time in the size of . Note that we can convert Problem 4 into a minimization problem by changing the sign or use the objective function by taking the exponential. Applications of this problem include the setting where, given a phylogenetic network and the probability of the arcs of , we wish to estimate a subdivision tree of to maximize the likelihood or log-likelihood .

###### Problem 4.

Given a rooted binary phylogenetic -network and an associated weighting function , find a subdivision tree of to maximize the value of the objective function .

The above fundamental problems give rise to interesting variations; for example, combining Problem 3 with Problem 4 formulates the following problem. The question is whether we can quickly generate a designated number of (sub)optimal solutions.

###### Problem 5.

Given a rooted binary phylogenetic -network , an associated weighting function , and , list subdivision trees in non-increasing order according to the value of the objective function .

## 4. Structure theorem for tree-based phylogenetic networks

In order to establish a framework to solve the above-mentioned problems, we provide a structural characterization of tree-based phylogenetic networks. For a rooted binary phylogenetic -network , we define a zig-zag trail in as a connected subgraph of with such that there exists a permutation of where either or holds for each . Any zig-zag trail in can be expressed by an alternating sequence of (not necessarily distinct) vertices and distinct arcs, such as ; however, we will more concisely represent above by writing or in reverse order. The notation may be also used when no confusion arises.

A zig-zag trail in is said to be maximal if contains no zig-zag trail such that is a proper subgraph of . Any maximal zig-zag trail in falls into one of the four types, which are defined as follows (see also Figure 1). A maximal zig-zag trail in with even is called a crown if can be written in the form ; otherwise, it is called a fence. Furthermore, a fence

with odd

is called an N-fence, which can be expressed as . Also, a fence with even is called a W-fence if it can be written as while it is called an M-fence if it can be written as . For any fence , its vertices and on both ends are called the endpoints of .

###### Remark 4.1.

Although the terms fence and crown in the theory of partially ordered sets usually refer to those that can be represented using bipartite graphs as in Figure 1 (the interested reader is referred to [12]), in our terminology, such an atypical M-fence as in Figure 2 is also allowed.

###### Lemma 4.2.

For any rooted binary phylogenetic -network , there exists a unique decomposition of such that each is a maximal zig-zag trail in .

###### Proof.

The proof is divided into two parts. We first claim that holds for any with . Indeed, any zig-zag trails and in have a common arc if and only if either or holds because would contain a vertex of in-degree or out-degree otherwise. Then, the claim follows from the maximality of and . Next, we will prove that for any , there exists a unique element of with . For any , there exists an obvious zig-zag trail in , that is, . Then, because is finite, there exists maximal one with . By using the first claim, we can conclude that such is uniquely determined by . This completes the proof. ∎

###### Lemma 4.3.

Let be a rooted binary phylogenetic -network and be the maximal zig-zag trail decomposition of . Then, is an admissible subset of if and only if is an admissible subset of for each .

###### Proof.

Our goal is to prove that satisfies the conditions C0, C1, and C2 in Definition 2.3 if and only if for each , the following C0, C1, and C2 hold:

C0:

contains all with or ;

C1:

for any with , exactly one of is in ;

C2:

for any with , at least one of is in .

If satisfies (or ), then there exists a unique element of with and (or ) by Lemma 4.2. The converse also holds as would not be maximal otherwise. Lemma 4.2 also implies that is partitioned into (note that no element of is empty). Thus, we can assert that satisfies C0 if and only if C0 holds for each . By similar reasoning, we can deduce that satisfies if and only if there exists a unique element of such that has the same property. Recalling that is a partition of , we have for any and any with . Hence, satisfies C1 if and only if C1 holds for each . The same arguments derive the desired conclusion regarding C2 and C2. This completes the proof. ∎

From now on, we consider an ordered set of maximal zig-zag trails in so that we can identify a subgraph of with a direct product . In addition, for each , we represent the set using a sequence of the elements of that form the zig-zag trail in this order, where . This allows us to encode arbitrary subset of using a 0-1 sequence of length . For example, given , we can specify the subset by the sequence . Using this notation, for each , we define a family of subsets of as follows.

 S(Zi):=⎧⎪ ⎪⎨⎪ ⎪⎩{⟨(10)mi/2⟩, ⟨(01)mi/2⟩}if Zi is a crown;{⟨1(01)(mi−1)/2⟩}if Zi is an N-% fence;{⟨1(01)p(10)q1⟩∣p,q∈Z≥0,p+q=(mi−2)/2}if Zi is an M-fence. (1)

Note that the above sequence representation of the subsets in does not depend on the ordering of the arcs of by virtue of the symmetric structure. For example, when is an N-fence, the sequence and its reverse ordering are identical.

###### Theorem 4.4 (Structure theorem for tree-based phylogenetic networks).

Let be a rooted binary phylogenetic -network and be a decomposition of where each is a maximal zig-zag trail in . Then, is a tree-based network on if and only if no element is a W-fence. In this case, the collection of subdivision trees of are characterized by

 T=∏i∈[1,ℓ]S(Zi), (2)

where is defined in .

###### Proof.

We first recall Theorem 3.1. By Lemma 4.2 and Lemma 4.3, one can produce every admissible subset of by choosing each independently. In what follows, we consider a maximal zig-zag trail with . We enumerate all 0-1 sequences corresponding to the admissible subsets of in each of the following four cases (see also Figure 4).

• When is a crown , the condition C0 does not apply. Repeated application of the conditions C1 and C2 derives the only solution from . Similarly, implies . This proves that a family of all admissible subsets of is given by .

• When is an N-fence , one of its endpoints is a reticulation vertex of . Then, the condition C0 gives , which implies as in the previous case. This proves that is the only admissible subset of .

• When is a W-fence , both and are reticulation vertices of , so the constraint C0 gives and again. Similarly to the above, implies while implies according to C1 and C2. However, this means that no admissible subset of exists because is even.

• When is an M-fence , we have and again but the other values are left undetermined. We claim that a family of all admissible subsets of with is given by . The proof is by induction on the length (recall that is even). The assertion is trivial for . We consider the two cases according to the value of in the sequence of length . When holds, is the only admissible subset of having this form. When holds, this only implies . By the induction hypothesis, the family of admissible subsets having this form consists of the sequences with , , and . This proves the claim.

This completes the proof. ∎

## 5. Algorithmic implications and connection to some known results

As we will now explain, Theorem 4.4 furnishes fast algorithms for the problems described in Section 3 and also allows us to prove different known results on tree-based phylogenetic networks from a unifying point of view.

###### Proposition 5.1.

For any rooted binary phylogenetic -network , one can obtain the maximal zig-zag trail decomposition of in time.

###### Proof.

As described in Algorithm 1, the above decomposition can be obtained by visiting each arc of exactly once, which requires time. This completes the proof. ∎

###### Corollary 5.2.

Let be a rooted binary phylogenetic -network that has subdivision trees and be the maximal zig-zag trail decomposition of . Then, holds, where

 α(Zi)=⎧⎪ ⎪ ⎪⎨⎪ ⎪ ⎪⎩0if Zi is a W-fence;1if Zi is an N-fence;2if Zi is a crown;|A(Zi)|/2if Zi is an M-fence. (3)

Proposition 5.1 and Corollary 5.2 yield the following linear time algorithm for Problem 2. It counts the number of subdivision trees of by first computing the maximal zig-zag trail decomposition of and then determining the number of admissible subsets of based on the structure of each . As we will demonstrate in the Appendix, the number has implications for measuring the complexity of .

Although our work is distinguished from previous studies (e.g.[4, 8, 11, 16]) by our structural result, the above counting algorithm can be viewed as a generalization of Francis and Steel’s algorithm [4] for Problem 1, which makes it evident that the problem of determining whether or not is tree-based can be solved in time. Besides, Corollary 5.2 proves that is tree-based if and only if no maximal zig-zag trail in is a W-fence (cf.[8, 16]) and also clarifies what factors on determine the number of subdivision trees of (cf.[8, 11]).

As the next corollary states, the problem of generating all solutions (Problem 3) can be easily solved in the same manner.

###### Corollary 5.3.

For any rooted binary phylogenetic -network , the number of subdivision trees of can be counted in time. Moreover, it is possible to list all subdivision trees of in linear delay.

###### Proof.

What remains unclear is the second statement. We may assume that holds. Consider the algorithm that first decomposes into maximal zig-zag trails and then generate all elements in the solution set . By Theorem 4.4, the elements of can be generated in time if is an M-fence, and in time otherwise. Then, each of the following procedures can be performed in time: to find one of the solutions; to check whether there exists another solution that has not been found yet; to produce the next solution if there exists any. This completes the proof. ∎

Likewise, the optimization problem (Problem 4) and its variation (Problem 5) can be solved in linear time, and so we have the following two corollaries. Note that we can turn it into a minimization problem or define an objective function in the product form as in Section 3.

###### Corollary 5.4.

For any rooted binary phylogenetic -network and any associated weighting function , a subdivision tree of to maximize the objective function can be found in time.

###### Corollary 5.5.

For any rooted binary phylogenetic -network , any associated weighting function , and any , it is possible in time to list subdivision trees in non-increasing order according to the value of the objective function .

## 6. Further research directions

We mention an open problem closely related to Problem 2 and also suggest another possible direction of future work.

### 6.1. Time complexity of counting base trees

Given a subdivision tree of a rooted binary phylogenetic -network , such a rooted binary phylogenetic -tree as in Definition 2.2 is called a base tree of  [4]. The time complexity of the following problem is still unknown.

###### Problem 6 ([4]).

Given a tree-based phylogenetic network , count the number of base trees of .

The tree-based network in Figure 5 strikingly demonstrates the difference between Problem 2 and Problem 6. Given this network as input , our counting algorithm returns although holds as it virtually contains only one phylogenetic tree. In contrast to being computable in time, it might be hard to determine efficiently for all . Even if so, it would be meaningful to consider the relationship between and towards the development of useful criteria for analyzing the complexity of phylogenetic networks.

### 6.2. Structural properties of non-binary phylogenetic networks

We note that the results in this paper still hold for any rooted phylogenetic network such that each vertex satisfies both and . This means that we can discuss the tree-basedness of non-binary phylogenetic networks that may have a vertex as in Figure 6. Extensions of the present work might lead to an interesting avenue of research other than the one studied in [8].

## References

• [1] M. Anaya, O. Anipchenko-Ulaj, A. Ashfaq, J. Chiu, M. Kaiser, M. S. Ohsawa, M. Owen, E. Pavlechko, K. St. John, S. Suleria, K. Thompson, and C. Yap, On determining if tree-based networks contain fixed trees, Bulletin of mathematical biology 78 (2016), no. 5, 961–969.
• [2] D. Bryant and V. Moulton, Neighbor-net: An agglomerative method for the construction of phylogenetic networks, Molecular Biology and Evolution 21 (2004), no. 2, 255–265.
• [3] D. H. Huson and D. Bryant, SplitsTree4 V4.14.6 (2017-09-26)
• [4] A. R. Francis and M. Steel, Which phylogenetic networks are merely trees with additional arcs?, Systematic Biology 64 (2015), no. 5, 768–777.
• [5] L. A. Goldberg, Efficient algorithms for listing combinatorial structures, vol. 5, Cambridge University Press, 2009.
• [6] M. Hayamizu, On the existence of infinitely many universal tree-based networks, Journal of Theoretical Biology 396 (2016), 204–206.
• [7] D. H. Huson, R. Rupp, and C. Scornavacca, Phylogenetic networks: concepts, algorithms and applications, Cambridge University Press, 2010.
• [8] L. Jetten, Characterising tree-based phylogenetic networks (karakterisatie van fylogenetische netwerken die een boom als basis hebben), Ph.D. thesis, Delft University of Technology, 11 2015, BSc thesis.
• [9] D. S. Johnson, M. Yannakakis, and C. H. Papadimitriou, On generating all maximal independent sets, Information Processing Letters 27 (1988), no. 3, 119–123.
• [10] S. Linz, K. St. John, and C. Semple, Counting trees in a phylogenetic network is #P-complete, SIAM Journal on Computing 42 (2013), no. 4, 1768–1776.
• [11] J. C. Pons, C. Semple, and M. Steel, Tree-based networks: characterisations, metrics, and support trees, Journal of Mathematical Biology (2018).
• [12] B. Schröder, Ordered sets: An introduction with connections from combinatorics to topology 2nd ed., Birkhäuser, 2016.
• [13] M. Steel, Phylogeny: Discrete and random processes in evolution, SIAM, 2016.
• [14] L. G. Valiant, The complexity of enumeration and reliability problems, SIAM Journal on Computing 8 (1979), no. 3, 410–421.
• [15] L. van Iersel, C. Semple, and M. Steel, Locating a tree in a phylogenetic network, Information Processing Letters 110 (2010), no. 23, 1037–1043.
• [16] L. Zhang, On tree-based phylogenetic networks, Journal of Computational Biology 23 (2016), no. 7, 553–565.

## Acknowledgment

This work was supported by JST PRESTO Grant Number JPMJPR16EB. The author thanks Kazuhisa Makino for improving an earlier version of this paper particularly by drawing the author’s attention to Problem 4 and Problem 5.

## Appendix: numerical example

We demonstrate how to count the number of subdivision trees of a rooted binary phylogenetic -network and how this can be useful to evaluate the complexity of . Given the network in Figure 7, our counting algorithm starts by decomposing into 21 maximal N-fences consisting of a single arc and 7 maximal M-fences and then returns . By comparing this output with the trivial upper bound , where denotes the number of reticulation vertices of , we see that it is meaningful to compute the exact value of . Although the number may seem huge, it is smaller than the number of rooted binary phylogenetic -trees that is given by . In other words, does not have adequate complexity in order to cover all rooted binary phylogenetic -trees. Thus, the number can be used as a quantitative measure for the complexity of , which may have implications for statistical model selection in evolutionary data analysis.