1. Introduction
Although phylogenetic networks are widely used to describe nontreelike evolution or to represent conflicts in data or uncertainty in evolutionary histories (e.g., [2, 3, 7, 13]), phylogenetic trees are still regarded as a fundamental model of evolution for their ultimate simplicity. Therefore, it is an essential undertaking to try to recognize a tree within a phylogenetic network (e.g., [10, 15]), which gives a possible explanation for Francis and Steel’s philosophy behind their definition of “treebased” phylogenetic networks [4] (see also [13]).
Intuitively, treebased phylogenetic networks can be seen as a natural generalization of phylogenetic trees since they are merely trees with additional arcs [4]. In the last few years, treebased networks have attracted much attention of theoretical biologists and their mathematical and computational aspects have been actively studied (e.g., [1, 6, 8, 11, 16]). In this context, although we will provide formal definitions later, the notion of “subdivision trees” plays an essential role because treebased networks can be defined as those having at least one subdivision trees.
In the theory of computational complexity, the most fundamental type of questions is concerning the time complexity of a
decision/search problem, such as the problem of determining whether or not a given phylogenetic network is treebased and finding a subdivision tree of if there exists any. In [4], Francis and Steel proved that this problem can be formulated as the 2satisfiability (2SAT) problem and provided a linear time algorithm for solving it. Then, in view of the fact that computing the number of satisfying solutions of 2SAT is #Pcomplete [14], they conjectured that the problem of counting the number of subdivision trees might also be hard [4]. Besides this counting problem, there are several natural problems whose time complexity is still unknown, which are summarized below.
Counting problem: given a rooted binary phylogenetic network , count the number of subdivision trees of . The time complexity of this problem was left open in [4]. This problem can be relevant to quantifying the complexity of phylogenetic networks as networks having many spanning trees tend to be more complex than those with only a few. While some results were obtained in [8, 11], but no polynomial time algorithm has been provided so far. It is known that a similar counting problem is #Pcomplete [10].

Enumeration problem: given a rooted binary phylogenetic network , list all subdivision trees of . Although the time complexity of this problem is exponential in the size of the input [4], this does not deny the existence of a fast algorithm. In the usual context of algorithm theory, the complexity of enumeration problems is evaluated in terms of the size of both input and output. It is meaningful to analyze this problem more carefully because listing a designated number of solutions, rather than all, makes practical sense in many applications.

Optimization problem: given a rooted binary phylogenetic network and a weighting of the arcs of , find a subdivision tree of
to maximize (or minimize) a prescribed objective function. From a statistical perspective, this problem can be interpreted as modeling the situation where a phylogenetic network and the probability of the arcs are given and we wish to estimate the best tree within the network to maximize the likelihood or loglikelihood.
In this paper, we will show that the above problems can be solved in linear time (for enumeration, in linear delay) as a consequence of a structural characterization of treebased phylogenetic networks (Theorem 4.4). Our structural result has implications on problems other than the above; for instance, it implies that one can quickly list a designated number of solutions in nonincreasing order according to the value of the objective function.
The remainder of this paper is organized as follows. We set up basic definitions and notation in Section 2, and give more formal description of the abovementioned problems in Section 3. In Section 4, we begin with introducing some concepts essential to our results and then prove a structure theorem for treebased phylogenetic networks. With this structural result in hand, in Section 5, we provide efficient algorithms for solving the above series of problems and also prove several known results on treebased networks from a unifying point of view. Section 6 indicates possible directions of further studies; in particular, we suggest a generalization of the present work by mentioning that the results in this paper still hold for a certain kind of nonbinary networks. The Appendix contains a numerical example to demonstrate how our algorithm solves the above counting problem and how this can give insights into the complexity of phylogenetic networks.
2. Preliminaries
Throughout this paper, denotes a nonempty finite set, and the terms “graph” and “network” all refer to finite, simple, acyclic digraphs (directed graphs), which we now define. A digraph
is an ordered pair
of a set of vertices and a set of arcs (i.e., directed edges). Given a digraph , we write and to represent the sets of vertices and arcs of , respectively. If and are finite sets, then is said to be finite. We use the notation for an arc oriented from a vertex to a vertex , and also write and to mean and , respectively. A digraph is said to be simple if holds for any and holds for any with . A simple digraph is said to be acyclic if has no cycle, namely, there is no sequence of three or more elements of such that holds for each , with indices taken .For graphs and , is called a subgraph of if both and hold, in which case we write . A subgraph is said to be proper if we have either or . A subgraph is said to be spanning if holds. Given a graph and a subset , is said to induce the subgraph of , where denotes a set of the heads and tails of all arcs in . Besides, given a graph with and a partition of , the collection is called a decomposition of , where a partition of a set is defined to be a collection of nonempty disjoint subsets of whose union is .
For a vertex of a digraph , the indegree and outdegree of in , denoted by and , are defined to be the cardinalities of the sets and , respectively. Given an acyclic digraph , a vertex is called a leaf of if holds.
Definition 2.1.
Given a finite set , a rooted binary phylogenetic network is defined to be any finite simple acyclic digraph that has the following properties:

there exists a unique vertex of with and ;

is the set of leaves of ;

each vertex satisfies .
In Definition 2.1, the vertex is called the root of , which can be interpreted as the origin of all species that are signified by the leaves of . In addition, we call a tree vertex of if holds, and a reticulation vertex of otherwise. In the case when has no reticulation vertex, is called a rooted binary phylogenetic tree.
Definition 2.2 ([4]).
For any rooted binary phylogenetic network , a subdivision tree of is defined to be a spanning tree of that can be obtained from a rooted binary phylogenetic tree by inserting zero or more vertices into each arc of .
Definition 2.3 ([4]).
Suppose is a subgraph of a rooted binary phylogenetic network. We say that a subset of is admissible if satisfies the following conditions:
 C0:

contains all with or .
 C1:

for any with , exactly one of is in .
 C2:

for any with , at least one of is in .
3. Motivating problems
Here, we provide the relevant background on “treebased” phylogenetic networks in order to describe the problems to be addressed in this paper. We say that a rooted binary phylogenetic network is treebased if has at least one subdivision tree (see [4] for their original definition). Intuitively, treebased networks can be viewed as a natural extension of rooted binary phylogenetic trees because that they are merely trees with additional arcs. In [4], Francis and Steel gave an algorithmic characterization of this class of networks. More precisely, they proved that the following decision/search problem can be formulated as the 2SAT problem and described a linear time algorithm for finding a subdivision tree of if there exists any as a consequence of Theorem 3.1. Thus, they have shown that the following decision/search problems can be solved in linear time.
Problem 1 ([4]).
Given a rooted binary phylogenetic network , determine whether or not is a treebased network on and find a subdivision tree of if there exists any.
Theorem 3.1 ([4]).
Given a rooted binary phylogenetic network , any subdivision tree of is a subgraph of induced by an admissible subset of . Moreover, there exists a bijection between the families of admissible subsets of and of arcsets of subdivision trees of .
As counting the number of satisfying solutions of 2SAT is known to be #Pcomplete [14], a natural question arises as to whether the following counting problem can be solved in a polynomial time, which was left open in [4].
Problem 2 ([4]).
Given a rooted binary phylogenetic network , count the number of subdivision trees of .
We will now describe some analogs to the above. The first question is whether there exists an efficient algorithm for the following enumeration problem.
Problem 3.
Given a rooted binary phylogenetic network , list all subdivision trees of .
In general, the number of solutions of enumeration problems can be exponential in the size of input or even infinite. In the usual context of algorithm theory, therefore, the time complexity of listing combinatorial structures has been analyzed in terms of both input size and output size. In particular, polynomialdelay algorithms [9], which generate all solutions one after another such that the time between the output of any two consecutive solutions is bounded by a polynomial function in the input size, is considered as one of the most efficient classes of enumeration algorithms [5]. Those algorithms are fast indeed as their running time is linear with respect to the size of the output. Hence, even though the number of solutions of Problem 3 is exponential in the size of [4], we can say that the problem is tractable if there exists a polynomialdelay algorithm for solving Problem 3.
The next question is whether the following optimization problem can be solved in polynomial time in the size of . Note that we can convert Problem 4 into a minimization problem by changing the sign or use the objective function by taking the exponential. Applications of this problem include the setting where, given a phylogenetic network and the probability of the arcs of , we wish to estimate a subdivision tree of to maximize the likelihood or loglikelihood .
Problem 4.
Given a rooted binary phylogenetic network and an associated weighting function , find a subdivision tree of to maximize the value of the objective function .
The above fundamental problems give rise to interesting variations; for example, combining Problem 3 with Problem 4 formulates the following problem. The question is whether we can quickly generate a designated number of (sub)optimal solutions.
Problem 5.
Given a rooted binary phylogenetic network , an associated weighting function , and , list subdivision trees in nonincreasing order according to the value of the objective function .
4. Structure theorem for treebased phylogenetic networks
In order to establish a framework to solve the abovementioned problems, we provide a structural characterization of treebased phylogenetic networks. For a rooted binary phylogenetic network , we define a zigzag trail in as a connected subgraph of with such that there exists a permutation of where either or holds for each . Any zigzag trail in can be expressed by an alternating sequence of (not necessarily distinct) vertices and distinct arcs, such as ; however, we will more concisely represent above by writing or in reverse order. The notation may be also used when no confusion arises.
A zigzag trail in is said to be maximal if contains no zigzag trail such that is a proper subgraph of . Any maximal zigzag trail in falls into one of the four types, which are defined as follows (see also Figure 1). A maximal zigzag trail in with even is called a crown if can be written in the form ; otherwise, it is called a fence. Furthermore, a fence
with odd
is called an Nfence, which can be expressed as . Also, a fence with even is called a Wfence if it can be written as while it is called an Mfence if it can be written as . For any fence , its vertices and on both ends are called the endpoints of .Remark 4.1.
Lemma 4.2.
For any rooted binary phylogenetic network , there exists a unique decomposition of such that each is a maximal zigzag trail in .
Proof.
The proof is divided into two parts. We first claim that holds for any with . Indeed, any zigzag trails and in have a common arc if and only if either or holds because would contain a vertex of indegree or outdegree otherwise. Then, the claim follows from the maximality of and . Next, we will prove that for any , there exists a unique element of with . For any , there exists an obvious zigzag trail in , that is, . Then, because is finite, there exists maximal one with . By using the first claim, we can conclude that such is uniquely determined by . This completes the proof. ∎
Lemma 4.3.
Let be a rooted binary phylogenetic network and be the maximal zigzag trail decomposition of . Then, is an admissible subset of if and only if is an admissible subset of for each .
Proof.
Our goal is to prove that satisfies the conditions C0, C1, and C2 in Definition 2.3 if and only if for each , the following C0, C1, and C2 hold:
 C0:

contains all with or ;
 C1:

for any with , exactly one of is in ;
 C2:

for any with , at least one of is in .
If satisfies (or ), then there exists a unique element of with and (or ) by Lemma 4.2. The converse also holds as would not be maximal otherwise. Lemma 4.2 also implies that is partitioned into (note that no element of is empty). Thus, we can assert that satisfies C0 if and only if C0 holds for each . By similar reasoning, we can deduce that satisfies if and only if there exists a unique element of such that has the same property. Recalling that is a partition of , we have for any and any with . Hence, satisfies C1 if and only if C1 holds for each . The same arguments derive the desired conclusion regarding C2 and C2. This completes the proof. ∎
From now on, we consider an ordered set of maximal zigzag trails in so that we can identify a subgraph of with a direct product . In addition, for each , we represent the set using a sequence of the elements of that form the zigzag trail in this order, where . This allows us to encode arbitrary subset of using a 01 sequence of length . For example, given , we can specify the subset by the sequence . Using this notation, for each , we define a family of subsets of as follows.
(1) 
Note that the above sequence representation of the subsets in does not depend on the ordering of the arcs of by virtue of the symmetric structure. For example, when is an Nfence, the sequence and its reverse ordering are identical.
Theorem 4.4 (Structure theorem for treebased phylogenetic networks).
Let be a rooted binary phylogenetic network and be a decomposition of where each is a maximal zigzag trail in . Then, is a treebased network on if and only if no element is a Wfence. In this case, the collection of subdivision trees of are characterized by
(2) 
where is defined in .
Proof.
We first recall Theorem 3.1. By Lemma 4.2 and Lemma 4.3, one can produce every admissible subset of by choosing each independently. In what follows, we consider a maximal zigzag trail with . We enumerate all 01 sequences corresponding to the admissible subsets of in each of the following four cases (see also Figure 4).

When is a crown , the condition C0 does not apply. Repeated application of the conditions C1 and C2 derives the only solution from . Similarly, implies . This proves that a family of all admissible subsets of is given by .

When is an Nfence , one of its endpoints is a reticulation vertex of . Then, the condition C0 gives , which implies as in the previous case. This proves that is the only admissible subset of .

When is a Wfence , both and are reticulation vertices of , so the constraint C0 gives and again. Similarly to the above, implies while implies according to C1 and C2. However, this means that no admissible subset of exists because is even.

When is an Mfence , we have and again but the other values are left undetermined. We claim that a family of all admissible subsets of with is given by . The proof is by induction on the length (recall that is even). The assertion is trivial for . We consider the two cases according to the value of in the sequence of length . When holds, is the only admissible subset of having this form. When holds, this only implies . By the induction hypothesis, the family of admissible subsets having this form consists of the sequences with , , and . This proves the claim.
This completes the proof. ∎
5. Algorithmic implications and connection to some known results
As we will now explain, Theorem 4.4 furnishes fast algorithms for the problems described in Section 3 and also allows us to prove different known results on treebased phylogenetic networks from a unifying point of view.
Proposition 5.1.
For any rooted binary phylogenetic network , one can obtain the maximal zigzag trail decomposition of in time.
Proof.
As described in Algorithm 1, the above decomposition can be obtained by visiting each arc of exactly once, which requires time. This completes the proof. ∎
Corollary 5.2.
Let be a rooted binary phylogenetic network that has subdivision trees and be the maximal zigzag trail decomposition of . Then, holds, where
(3) 
Proposition 5.1 and Corollary 5.2 yield the following linear time algorithm for Problem 2. It counts the number of subdivision trees of by first computing the maximal zigzag trail decomposition of and then determining the number of admissible subsets of based on the structure of each . As we will demonstrate in the Appendix, the number has implications for measuring the complexity of .
Although our work is distinguished from previous studies (e.g., [4, 8, 11, 16]) by our structural result, the above counting algorithm can be viewed as a generalization of Francis and Steel’s algorithm [4] for Problem 1, which makes it evident that the problem of determining whether or not is treebased can be solved in time. Besides, Corollary 5.2 proves that is treebased if and only if no maximal zigzag trail in is a Wfence (cf., [8, 16]) and also clarifies what factors on determine the number of subdivision trees of (cf., [8, 11]).
As the next corollary states, the problem of generating all solutions (Problem 3) can be easily solved in the same manner.
Corollary 5.3.
For any rooted binary phylogenetic network , the number of subdivision trees of can be counted in time. Moreover, it is possible to list all subdivision trees of in linear delay.
Proof.
What remains unclear is the second statement. We may assume that holds. Consider the algorithm that first decomposes into maximal zigzag trails and then generate all elements in the solution set . By Theorem 4.4, the elements of can be generated in time if is an Mfence, and in time otherwise. Then, each of the following procedures can be performed in time: to find one of the solutions; to check whether there exists another solution that has not been found yet; to produce the next solution if there exists any. This completes the proof. ∎
Likewise, the optimization problem (Problem 4) and its variation (Problem 5) can be solved in linear time, and so we have the following two corollaries. Note that we can turn it into a minimization problem or define an objective function in the product form as in Section 3.
Corollary 5.4.
For any rooted binary phylogenetic network and any associated weighting function , a subdivision tree of to maximize the objective function can be found in time.
Corollary 5.5.
For any rooted binary phylogenetic network , any associated weighting function , and any , it is possible in time to list subdivision trees in nonincreasing order according to the value of the objective function .
6. Further research directions
We mention an open problem closely related to Problem 2 and also suggest another possible direction of future work.
6.1. Time complexity of counting base trees
Given a subdivision tree of a rooted binary phylogenetic network , such a rooted binary phylogenetic tree as in Definition 2.2 is called a base tree of [4]. The time complexity of the following problem is still unknown.
Problem 6 ([4]).
Given a treebased phylogenetic network , count the number of base trees of .
The treebased network in Figure 5 strikingly demonstrates the difference between Problem 2 and Problem 6. Given this network as input , our counting algorithm returns although holds as it virtually contains only one phylogenetic tree. In contrast to being computable in time, it might be hard to determine efficiently for all . Even if so, it would be meaningful to consider the relationship between and towards the development of useful criteria for analyzing the complexity of phylogenetic networks.
6.2. Structural properties of nonbinary phylogenetic networks
We note that the results in this paper still hold for any rooted phylogenetic network such that each vertex satisfies both and . This means that we can discuss the treebasedness of nonbinary phylogenetic networks that may have a vertex as in Figure 6. Extensions of the present work might lead to an interesting avenue of research other than the one studied in [8].
References
 [1] M. Anaya, O. AnipchenkoUlaj, A. Ashfaq, J. Chiu, M. Kaiser, M. S. Ohsawa, M. Owen, E. Pavlechko, K. St. John, S. Suleria, K. Thompson, and C. Yap, On determining if treebased networks contain fixed trees, Bulletin of mathematical biology 78 (2016), no. 5, 961–969.
 [2] D. Bryant and V. Moulton, Neighbornet: An agglomerative method for the construction of phylogenetic networks, Molecular Biology and Evolution 21 (2004), no. 2, 255–265.
 [3] D. H. Huson and D. Bryant, SplitsTree4 V4.14.6 (20170926), http://www.splitstree.org/.
 [4] A. R. Francis and M. Steel, Which phylogenetic networks are merely trees with additional arcs?, Systematic Biology 64 (2015), no. 5, 768–777.
 [5] L. A. Goldberg, Efficient algorithms for listing combinatorial structures, vol. 5, Cambridge University Press, 2009.
 [6] M. Hayamizu, On the existence of infinitely many universal treebased networks, Journal of Theoretical Biology 396 (2016), 204–206.
 [7] D. H. Huson, R. Rupp, and C. Scornavacca, Phylogenetic networks: concepts, algorithms and applications, Cambridge University Press, 2010.
 [8] L. Jetten, Characterising treebased phylogenetic networks (karakterisatie van fylogenetische netwerken die een boom als basis hebben), Ph.D. thesis, Delft University of Technology, 11 2015, BSc thesis.
 [9] D. S. Johnson, M. Yannakakis, and C. H. Papadimitriou, On generating all maximal independent sets, Information Processing Letters 27 (1988), no. 3, 119–123.
 [10] S. Linz, K. St. John, and C. Semple, Counting trees in a phylogenetic network is #Pcomplete, SIAM Journal on Computing 42 (2013), no. 4, 1768–1776.
 [11] J. C. Pons, C. Semple, and M. Steel, Treebased networks: characterisations, metrics, and support trees, Journal of Mathematical Biology (2018).
 [12] B. Schröder, Ordered sets: An introduction with connections from combinatorics to topology 2nd ed., Birkhäuser, 2016.
 [13] M. Steel, Phylogeny: Discrete and random processes in evolution, SIAM, 2016.
 [14] L. G. Valiant, The complexity of enumeration and reliability problems, SIAM Journal on Computing 8 (1979), no. 3, 410–421.
 [15] L. van Iersel, C. Semple, and M. Steel, Locating a tree in a phylogenetic network, Information Processing Letters 110 (2010), no. 23, 1037–1043.
 [16] L. Zhang, On treebased phylogenetic networks, Journal of Computational Biology 23 (2016), no. 7, 553–565.
Acknowledgment
Appendix: numerical example
We demonstrate how to count the number of subdivision trees of a rooted binary phylogenetic network and how this can be useful to evaluate the complexity of . Given the network in Figure 7, our counting algorithm starts by decomposing into 21 maximal Nfences consisting of a single arc and 7 maximal Mfences and then returns . By comparing this output with the trivial upper bound , where denotes the number of reticulation vertices of , we see that it is meaningful to compute the exact value of . Although the number may seem huge, it is smaller than the number of rooted binary phylogenetic trees that is given by . In other words, does not have adequate complexity in order to cover all rooted binary phylogenetic trees. Thus, the number can be used as a quantitative measure for the complexity of , which may have implications for statistical model selection in evolutionary data analysis.
Comments
There are no comments yet.