Although phylogenetic networks are widely used to describe non-tree-like evolution or to represent conflicts in data or uncertainty in evolutionary histories (e.g., [2, 3, 8, 15]), phylogenetic trees are still regarded as a fundamental model of evolution for their ultimate simplicity. Therefore, it is an essential undertaking to try to recognize a tree within a phylogenetic network (e.g., [12, 17]), which gives a possible explanation for Francis and Steel’s philosophy behind their definition of “tree-based” phylogenetic networks  (see also ).
Intuitively, tree-based phylogenetic networks can be seen as a natural generalization of phylogenetic trees since they are merely trees with additional arcs . In the last few years, tree-based networks have attracted much attention of theoretical biologists and their mathematical and computational aspects have been actively studied (e.g., [1, 7, 10, 13, 18]). In this context, although we will provide formal definitions later, the notion of “subdivision trees” plays an essential role because tree-based networks can be defined as those having at least one subdivision trees.
In the theory of computational complexity, the most fundamental type of questions is concerning the time complexity of adecision/search problem, such as the problem of determining whether or not a given phylogenetic network is tree-based and finding a subdivision tree of if there exists any. In , Francis and Steel proved that this problem can be formulated as the 2-satisfiability (2-SAT) problem and provided a linear time algorithm for solving it. Then, in view of the fact that computing the number of satisfying solutions of 2-SAT is #P-complete , they conjectured that the problem of counting all subdivision trees of might also be hard .
Including the above counting problem, below we pose a series of computational problems surrounding subdivision trees, with the overarching aim to explore new applications of tree-based phylogenetic networks.
Counting problem: given a rooted binary phylogenetic network , count the number of subdivision trees of . In contrast to a similar tree-counting problem that is #P-complete , this problem can be solved in polynomial time as indicated by the formulae for that were obtained in [10, 13]. This counting problem has an interesting connection to the data analysis for quantifying the complexity of phylogenetic networks because networks having many spanning trees tend to be more complex than those with only a few. Thus, it is meaningful to analyze the time complexity of this problem in more details and to develop a fast algorithm for counting .
Enumeration problem: given a rooted binary phylogenetic network with , list all subdivision trees of . It is obvious that the time complexity of this problem is exponential in the size of , but this does not deny the existence of a fast algorithm. Indeed, we need a fuller complexity analysis because in the usual context of algorithm theory, the time complexity of enumeration problems is evaluated in terms of the size of both input and output. It is also important to consider a fast enumeration algorithm because listing a designated number of solutions, rather than all, enables statistical applications such as generating subdivision trees of uniformly at random.
Optimization problem: given a rooted binary phylogenetic network with , together with a non-negative weighting on the arcs of , find a subdivision tree of to maximize (or minimize) the value of a prescribed objective function . From a statistical perspective, this problem can be viewed as modeling the situation where a phylogenetic network
and the probabilityof the arcs of
are given and we wish to estimate the best-fit treeinside to maximize the likelihood or log-likelihood . An obvious algorithm computes the values of the objective function, which requires exponential time in the size of . The question is whether it is possible or not to obtain an optimal solution without doing such an exhaustive search.
The main goal of this paper is to provide a graph theoretical framework for solving many different problems on subdivision trees, including the above-mentioned ones, in a simple and unified manner. To this end, we prove a “structure theorem for tree-based phylogenetic networks” (Theorem 4.4) that characterizes the (possibly empty) collection of all subdivision trees of a given rooted binary phylogenetic network . Our theorem furnishes linear time (for enumeration, linear delay) algorithms for the above problems.
Furthermore, we must stress that the results and algorithms in this paper still hold true for some non-binary phylogenetic networks. Hence, we can obtain a partial answer to the question from Pons, Semple, and Steel  as to what the number is when is not necessarily binary. This is an interesting byproduct that illustrates an advantage of our structural approach compared to earlier work exploiting existing tools such as 2-SAT algorithms  and Hall’s marriage theorem [9, 10, 13, 18].
The remainder of this paper is organized as follows. We set up basic definitions and notation in Section 2 and give a more detailed description of the above motivating problems in Section 3. In Section 4, we introduce a structure called the “maximal zig-zag trail decomposition” that is inherent in each rooted binary phylogenetic network (Lemma 4.2). Using this key lemma along with a structural version of Francis and Steel’s theorem (Lemma 4.3), we state and prove our main result as in Theorem 4.4. Section 5 is about the algorithmic implications of the theorem; we describe a series of algorithms for the problems posed in Section 3 and give a numerical example where appropriate. Section 6 contains a brief summary of relevant research with the intention to show that our theorem implies and unifies various results in the literature. Finally, we conclude the paper by suggesting two possible directions for further research in Section 7, where we provide a conjecture on a related open problem raised in  and also derive a partial solution to the aforementioned problem posed by Pons et al. .
Throughout this paper, denotes a non-empty finite set, and the terms “graph” and “network” all refer to finite, simple, acyclic digraphs (directed graphs), which we now define. A digraph
is an ordered pairof a set of vertices and a set of arcs (i.e., directed edges). Given a digraph , we write and to represent the sets of vertices and arcs of , respectively. If and are finite sets, then is said to be finite. We use the notation for an arc oriented from a vertex to a vertex , and also write and to mean and , respectively. A digraph is said to be simple if holds for any and holds for any with . A simple digraph is said to be acyclic if has no cycle, namely, there is no sequence of two or more elements of such that holds for each , with indices taken .
For graphs and , is called a subgraph of if both and hold, in which case we write . A subgraph is said to be proper if we have either or . A subgraph is said to be spanning if holds. Given a graph and a subset , is said to induce the subgraph of , where denotes a set of the heads and tails of all arcs in . Besides, given a graph with and a partition of , the collection is called a decomposition of , where a partition of a set is defined to be a collection of non-empty disjoint subsets of whose union is .
For a vertex of a digraph , the in-degree and out-degree of in , denoted by and , are defined to be the cardinalities of the sets and , respectively. Given an acyclic digraph , a vertex is called a leaf of if holds.
Given a finite set , a rooted binary phylogenetic -network is defined to be any finite simple acyclic digraph that has the following properties:
there exists a unique vertex of with and ;
is the set of leaves of ;
each vertex satisfies .
In Definition 2.1, the vertex is called the root of , which can be interpreted as the origin of all species that are signified by the leaves of . In addition, we call a tree vertex of if holds, and a reticulation vertex of otherwise. In the case when has no reticulation vertex, is called a rooted binary phylogenetic -tree.
If a rooted binary phylogenetic -network has a spanning tree of that can be obtained from a rooted binary phylogenetic -tree by inserting zero or more vertices into each arc of , then is called a subdivision tree of .
Definition 2.3 (see  for the original definition).
A rooted binary phylogenetic -network is called a tree-based phylogenetic network (on ) if has at least one subdivision tree.
We note that Definition 2.3 is slightly more versatile than the original definition of tree-based networks in  as it still makes sense for non-binary phylogenetic networks. Such a generalization is not the focus of this paper, but we treat some non-binary phylogenetic networks in Section 7.
Definition 2.4 ().
Suppose is a subgraph of a rooted binary phylogenetic -network. We say that a subset of is admissible if satisfies the following conditions:
contains all with or .
for any with , exactly one of is in .
for any with , at least one of is in .
3. Motivating problems
Here, we provide the relevant background on tree-based phylogenetic networks and then describe the problems to be addressed in this paper. Intuitively, tree-based networks can be viewed as a natural extension of rooted binary phylogenetic -trees because that they are merely trees with additional arcs. In , Francis and Steel gave an algorithmic characterization of this class of networks. More precisely, they proved that the following decision/search problem can be formulated as the 2-SAT problem and described a linear time algorithm for finding a subdivision tree of if there exists any as a consequence of Theorem 3.1. Thus, they have shown that the following decision/search problems can be solved in linear time.
Problem 1 ().
Given a rooted binary phylogenetic -network , determine whether or not is a tree-based network on and find a subdivision tree of if there exists any.
Theorem 3.1 ().
A rooted binary phylogenetic -network is tree-based if and only if there exists an admissible subset of . In this case, induces a subdivision tree of . Moreover, there exists a bijection between the families of admissible subsets of and of arc-sets of subdivision trees of .
Then, the question arises: what is the time complexity of the following counting problem? Francis and Steel  noted that counting the number of subdivision trees of a tree-based phylogenetic network might also be #P-complete as counting the number of satisfying solutions of 2-SAT is known to be #P-complete . Contrary to this conjecture, Jetten  and Pons et al.  derived equivalent formulae for that indicate the existence of a polynomial time algorithm for it. However, a more detailed time complexity analysis has not been provided to date.
Problem 2 ().
Given a rooted binary phylogenetic -network , count the number of subdivision trees of .
In addition to the above two, let us describe some associated problems that are interesting when holds (i.e., is tree-based). The first question is whether there exists an efficient algorithm for the following enumeration problem.
Given a tree-based phylogenetic network on , list all subdivision trees of .
In general, the number of solutions of enumeration problems can be exponential in the size of input or even infinite. In the usual context of algorithm theory, therefore, the time complexity of listing combinatorial structures has been analyzed in terms of both input size and output size. In particular, polynomial delay algorithms , which generate all solutions one after another such that the time between the output of any two consecutive solutions is bounded by a polynomial function in the input size, is considered as one of the most efficient classes of enumeration algorithms . Those algorithms are fast indeed as their running time is linear with respect to the size of the output. Hence, even though the number of solutions of Problem 3 is exponential in the size of , if there exists a polynomial delay algorithm for it, then the problem turns out to be tractable, leading to statistical applications such as generating a subdivision tree of uniformly at random.
The next question is whether the following optimization problem can be solved in polynomial time in the size of . Note that we can convert Problem 4 into a minimization problem by changing the sign or use the objective function by taking the exponential. Applications of this problem include the setting where, given a phylogenetic network and the probability of the arcs of , we wish to estimate a subdivision tree of to maximize the likelihood or log-likelihood .
Given a tree-based phylogenetic network on and an associated weighting function , find a subdivision tree of to maximize the value of the objective function .
4. Structure theorem for tree-based phylogenetic networks
For a rooted binary phylogenetic -network , we define a zig-zag trail in as a connected subgraph of with such that there exists a permutation of where either or holds for each . Any zig-zag trail in can be expressed by an alternating sequence of (not necessarily distinct) vertices and distinct arcs, such as ; however, we will more concisely represent above by writing or in reverse order. The notation may be also used when no confusion arises.
A zig-zag trail in is said to be maximal if contains no zig-zag trail such that is a proper subgraph of . Any maximal zig-zag trail in falls into one of the four types, which are defined as follows (see also Figure 1). A maximal zig-zag trail in with even is called a crown if can be written in the form ; otherwise, it is called a fence. Furthermore, a fence
with oddis called an N-fence, which can be expressed as . Also, a fence with even is called a W-fence if it can be written as while it is called an M-fence if it can be written as . For any fence , its vertices and on both ends are called the endpoints of .
The following lemma is essential to state our subsequent structural results as it describes a structure inherent in each rooted binary phylogenetic -network.
For any rooted binary phylogenetic -network , there exists a unique decomposition of such that each is a maximal zig-zag trail in .
The proof is divided into two parts. We first claim that holds for any with . Indeed, any zig-zag trails and in have a common arc if and only if either or holds because would contain a vertex of in-degree or out-degree otherwise. Then, the claim follows from the maximality of and . Next, we will prove that for any , there exists a unique element of with . For any , there exists an obvious zig-zag trail in , that is, . Then, because is finite, there exists maximal one with . By using the first claim, we can conclude that such is uniquely determined by . This completes the proof. ∎
Using the idea of maximal zig-zag trail decomposition, we can state a structural version of Theorem 3.1 as follows.
Let be a rooted binary phylogenetic -network and be the maximal zig-zag trail decomposition of . Then, is an admissible subset of if and only if is an admissible subset of for each .
Our goal is to prove that satisfies the conditions C0, C1, and C2 in Definition 2.4 if and only if for each , the following C0, C1, and C2 hold:
contains all with or ;
for any with , exactly one of is in ;
for any with , at least one of is in .
If satisfies (or ), then there exists a unique element of with and (or ) by Lemma 4.2. The converse also holds as would not be maximal otherwise. Lemma 4.2 also implies that is partitioned into (note that no element of is empty). Thus, we can assert that satisfies C0 if and only if C0 holds for each . By similar reasoning, we can deduce that satisfies if and only if there exists a unique element of such that has the same property. Recalling that is a partition of , we have for any and any with . Hence, satisfies C1 if and only if C1 holds for each . The same arguments derive the desired conclusion regarding C2 and C2. This completes the proof. ∎
From now on, we consider an ordered set of maximal zig-zag trails in so that we can identify a subgraph of with a direct product . In addition, for each , we represent the set using a sequence of the elements of that form the zig-zag trail in this order, where . This allows us to encode arbitrary subset of using a 0-1 sequence of length . For example, given , we can specify the subset by the sequence . Using this notation, for each , we define a family of subsets of as follows.
Note that the above sequence representation of the subsets in does not depend on the ordering of the arcs of by virtue of the symmetric structure. For example, when is an N-fence, the sequence and its reverse ordering are identical.
Theorem 4.4 (Structure theorem for tree-based phylogenetic networks).
Let be a rooted binary phylogenetic -network and be a decomposition of where each is a maximal zig-zag trail in . Then, is a tree-based phylogenetic network on if and only if no element is a W-fence. In this case, the collection of subdivision trees of are characterized by
where is defined in .
We first recall Theorem 3.1. By Lemma 4.2 and Lemma 4.3, one can produce every admissible subset of by choosing each independently. In what follows, we consider a maximal zig-zag trail with . We enumerate all 0-1 sequences corresponding to the admissible subsets of in each of the following four cases (see also Figure 4).
When is a crown , the condition C0 does not apply. Repeated application of the conditions C1 and C2 derives the only solution from . Similarly, implies . This proves that a family of all admissible subsets of is given by .
When is an N-fence , one of its endpoints is a reticulation vertex of . Then, the condition C0 gives , which implies as in the previous case. This proves that is the only admissible subset of .
When is a W-fence , both and are reticulation vertices of , so the constraint C0 gives and again. Similarly to the above, implies while implies according to C1 and C2. However, this means that no admissible subset of exists because is even.
When is an M-fence , we have and again but the other values are left undetermined. We claim that a family of all admissible subsets of with is given by . The proof is by induction on the length (recall that is even). The assertion is trivial for . We consider the two cases according to the value of in the sequence of length . When holds, is the only admissible subset of having this form. When holds, this only implies . By the induction hypothesis, the family of admissible subsets having this form consists of the sequences with , , and . This proves the claim.
This completes the proof. ∎
5. Algorithmic implications
From Theorem 4.4, we can derive fast algorithms for the problems posed in Section 3. The next proposition is relevant to each problem because all algorithms presented here start by decomposing the input network into maximal zig-zag trails.
For any rooted binary phylogenetic -network , one can obtain the maximal zig-zag trail decomposition of in time.
As described in Algorithm 1, the above decomposition can be obtained by visiting each arc of exactly once, which requires time. This completes the proof. ∎
Let be a rooted binary phylogenetic -network that has subdivision trees and be the maximal zig-zag trail decomposition of . Then, holds, where
Now, we are in a position to describe a linear time algorithm for solving Problem 1 and Problem 2 simultaneously. Given a rooted binary phylogenetic -network , the algorithm counts the number of subdivision trees of as follows: 1) compute the maximal zig-zag trail decomposition of ; 2) determine for each according to the equation (3); 3) return . In this procedure, the most expensive step is the first one, which requires time by Proposition 5.1.
As we now demonstrate, the number can give insights into the complexity of a tree-based phylogenetic network on . Given a tree-based network shown in Figure 5, our algorithm starts by decomposing into 21 maximal N-fences consisting of a single arc and 7 maximal M-fences, and so returns . Comparing this output with the trivial upper bound , where denotes the number of reticulation vertices of , we can see that it is meaningful to compute the exact value of . Although the number may seem huge, it is smaller than the number of rooted binary phylogenetic -trees that is given by . In other words, does not have adequate complexity in order to cover all rooted binary phylogenetic -trees. Thus, the number can be used as a quantitative measure for the complexity of , which may have implications on model selection in evolutionary data analysis.
The next corollary states that the problem of generating all solutions (Problem 3) can also be easily solved in the same manner.
For any rooted binary phylogenetic -network , the number of subdivision trees of can be counted in time. Moreover, when holds, it is possible to list all subdivision trees of in linear delay.
What remains unclear is the second statement. We may assume that holds. Consider the algorithm that first decomposes into maximal zig-zag trails and then generate all elements in the solution set . By Theorem 4.4, the elements of can be generated in time if is an M-fence, and in time otherwise. Then, each of the following steps requires time: find a solution; check whether there exists another solution that has not been found yet; find the next solution if there exists any. This completes the proof. ∎
Recalling that the running time of polynomial delay algorithms is linear in the size of the output as mentioned in Section 3, we have the following corollary.
For any tree-based phylogenetic network on that has subdivision trees, it is possible to generate subdivision trees in time.
As stated in the next corollary, the optimization problem (Problem 4) can be solved in linear time in the size of . Note that we can turn it into a minimization problem or define an objective function in the product form as mentioned in Section 3.
For any tree-based phylogenetic network on and any associated weighting function , a subdivision tree of that maximizes the value of the objective function can be found in time.
Assume that we have computed the maximal zig-zag trail decomposition of . By virtue of the decomposable nature of , we have , where for each . Then, by focusing on the maximization of each term , we can obtain a global maximum. Finding a local maximum requires time for each . Overall, time suffices. This completes the proof. ∎
6. Connection to some known results
Before showing that our structure theorem implies and unifies various results in the literature, we briefly review relevant studies on tree-based phylogenetic networks. After Francis and Steel  provided a linear time algorithm for the decision/search problem by focusing on its connection to the 2-SAT problem, Problem 1 and Problem 2 have been studied by several different authors. For example, Zhang  proposed yet another linear time algorithm for the above decision problem by characterizing forbidden subgraphs of tree-based networks via an application of Hall’s marriage theorem. Independently from , Jetten and van Iersel  obtained the same graph theoretical characterization and further considered non-binary phylogenetic networks. Jetten  also derived a formula for the number of subdivision trees of a tree-based phylogenetic network . More recently, Pons et al.  re-derived an equivalent formula and pointed out that their results mean that Problem 2 is solvable in polynomial time.
We can easily derive the above results from Theorem 4.4 as it provides a characterization of the collection of all subdivision trees of . Our structural approach makes it easy to see that maximal W-fences are the forbidden subgraphs of tree-based phylogenetic networks (cf. [10, 18]) and that the number of crowns in and the number and lengths of maximal M-fences in are the factors that contribute to when is tree-based (cf. [9, 13]). Thus, we can obtain the same conclusions in a straightforward way without needing existing results from matching theory (cf. [9, 10, 13, 18]). Also, as we have seen earlier, the theorem yields a linear time algorithm for solving Problem 1 and Problem 2 simultaneously, revealing a precise bound on the time complexity of Problem 2 that cannot be seen through the application of the 2-SAT problem (cf. ).
7. Further research directions
In this paper, we have posed various important computational problems about tree-based phylogenetic networks, and provided efficient algorithms for the individual problems in a unified manner as a consequence of the structural theorem for tree-based phylogenetic networks. Our work does not only present an elegant approach for proving various results in the literature but also enable new statistical applications such as measuring the complexity of tree-based networks, uniform sampling of subdivision trees, and finding the maximum likelihood subdivision tree. We shall end the paper by suggesting two research directions that would be interesting to pursue.
7.1. Time complexity of counting base trees
Given a subdivision tree of a tree-based phylogenetic network on , such a rooted binary phylogenetic -tree as described in Definition 2.2 is called a base tree of . It is still unknown whether the following problem can be solved in polynomial time.
Problem 5 ().
Given a tree-based phylogenetic network on , count the number of base trees of .
The tree-based network shown in Figure 6 strikingly demonstrates the difference between Problem 2 and Problem 5. Given this network as input , our counting algorithm returns although holds as it virtually contains only one phylogenetic tree. We conjecture that Problem 5 is #P-complete in contrast to Problem 2 being solvable in linear time. Even if so, however, it would be meaningful to consider the relationship between and towards the development of useful criteria for analyzing the complexity of phylogenetic networks.
7.2. Generalization to non-binary phylogenetic networks
Recently, several attempts have been made to extend the results on binary tree-based phylogenetic networks to non-binary ones (e.g., [4, 10, 13]). There are many challenges in this direction; for example, Pons et al.  posed the natural problem of determining the number of subdivision trees if is not necessarily binary.
Therefore, it would be helpful to comment on the extendability of the present results to some non-binary phylogenetic networks and to provide a partial answer to the above question raised in . First, as mentioned in Section 2, our definition of tree-based phylogenetic networks (Definition 2.3) does not require be binary (cf. ). Also, by virtue of our decomposition-based approach, we can readily see that the results and algorithms in this paper still hold true for any rooted phylogenetic -network such that each vertex satisfies both and . Hence, the previous results referred to in Section 6 including the formulae for are still valid if contains such vertices as shown in Figure 7.
-  M. Anaya, O. Anipchenko-Ulaj, A. Ashfaq, J. Chiu, M. Kaiser, M. S. Ohsawa, M. Owen, E. Pavlechko, K. St. John, S. Suleria, K. Thompson, and C. Yap, On determining if tree-based networks contain fixed trees, Bulletin of Mathematical Biology 78 (2016), no. 5, 961–969.
-  D. Bryant and V. Moulton, Neighbor-net: An agglomerative method for the construction of phylogenetic networks, Molecular Biology and Evolution 21 (2004), no. 2, 255–265.
-  D. H. Huson and D. Bryant, SplitsTree4 V4.14.6 (2017-09-26), http://www.splitstree.org/.
-  M. Fischer, M. Galla, L. Herbst, Y. Long, and K. Wicke, Non-binary treebased unrooted phylogenetic networks and their relations to binary and rooted ones, arXiv:1810.06853 (2018).
-  A. R. Francis and M. Steel, Which phylogenetic networks are merely trees with additional arcs?, Systematic Biology 64 (2015), no. 5, 768–777.
-  L. A. Goldberg, Efficient algorithms for listing combinatorial structures, vol. 5, Cambridge University Press, 2009.
-  M. Hayamizu, On the existence of infinitely many universal tree-based networks, Journal of Theoretical Biology 396 (2016), 204–206.
-  D. H. Huson, R. Rupp, and C. Scornavacca, Phylogenetic networks: concepts, algorithms and applications, Cambridge University Press, 2010.
-  L. Jetten, Characterising tree-based phylogenetic networks, 11 2015, bachelor thesis, TU Delft repository uuid:fda2636d- 0ed5-4dd2-bacf-8abbbad8994e.
-  L. Jetten and L. van Iersel, Nonbinary tree-based phylogenetic networks, IEEE/ACM transactions on computational biology and bioinformatics 15 (2018), no. 1, 205–217.
-  D. S. Johnson, M. Yannakakis, and C. H. Papadimitriou, On generating all maximal independent sets, Information Processing Letters 27 (1988), no. 3, 119–123.
-  S. Linz, K. St. John, and C. Semple, Counting trees in a phylogenetic network is #P-complete, SIAM Journal on Computing 42 (2013), no. 4, 1768–1776.
-  J. C. Pons, C. Semple, and M. Steel, Tree-based networks: characterisations, metrics, and support trees, Journal of Mathematical Biology 78 (2019), no. 4, 899–918.
-  B. Schröder, Ordered sets: An introduction with connections from combinatorics to topology 2nd ed., Birkhäuser, 2016.
-  M. Steel, Phylogeny: Discrete and random processes in evolution, SIAM, 2016.
-  L. G. Valiant, The complexity of enumeration and reliability problems, SIAM Journal on Computing 8 (1979), no. 3, 410–421.
-  L. van Iersel, C. Semple, and M. Steel, Locating a tree in a phylogenetic network, Information Processing Letters 110 (2010), no. 23, 1037–1043.
-  L. Zhang, On tree-based phylogenetic networks, Journal of Computational Biology 23 (2016), no. 7, 553–565.
This study was supported by JST PRESTO Grant Number JPMJPR16EB and was conducted independently from [10, 13, 18]. Most results in this paper were announced in the author’s talk entitled “A linear time algorithm for counting the number of support trees in a tree-based network” on February 13th, 2018 at the Portobello 2018 Conference (The Interface of Mathematics and Biology, The 22nd Annual New Zealand Phylogenomics Meeting). The author thanks Kazuhisa Makino for improving an earlier version of this paper by suggesting Problem 4 and by providing useful comments on Subsection 7.2. The author is also grateful to Andrew Francis, Leo van Iersel, and Louxin Zhang for providing information on related studies and to Mike Steel for some helpful comments.