To represent evolutionary relationships among species, phylogenetic trees have long been a powerful tool. However, as we now not only acknowledge speciation but also non-tree-like processes such as hybridization and lateral gene transfer to be driving forces in the evolution of certain groups of organisms (e.g. bacteria, plants, and fish) mallet16 ; soucy15 , phylogenetic networks become more widely used to represent ancestral histories. A phylogenetic network is a generalization of a rooted phylogenetic tree. More precisely, such a network is a rooted directed acyclic graph whose leaves are labeled huson10 .
The following optimization problem, which is biologically relevant and mathematically challenging, motivates much of the theoretical work that has been done in reconstructing phylogenetic networks from phylogenetic trees. Given a collection of rooted binary phylogenetic trees on a set of species such that correctly represents the tree-like evolution of different parts of the species’ genomes, what is the smallest number of reticulation events that is required to simultaneously embed the trees in into a phylogenetic network? Here, reticulation events are collectively referring to all non-tree-like events and they are represented by vertices in a phylogenetic network whose in-degree is at least two. Without any structural constraints on a phylogenetic network, it is well-known that can always be embedded into such a network baroni05 ; semple07 and, hence, the optimization problem is well-defined. Moreover, despite the problem being NP-hard bordewich07 , even for when , several exact algorithms have been developed that, given two rooted phylogenetic trees, construct a phylogenetic network whose number of reticulation events is minimized over the space of all networks that embed both trees albrecht12 ; chen13 ; piovesan12 ; wu10 .
Motivated by the introduction of temporal networks baroni06 ; moret04 , which are phylogenetic networks that satisfy several time constraints, Humphries et al. humphries13 ; humphries13a recently investigated the special case of the aforementioned optimization problem for when one is interested in minimizing the number of reticulation events over the smaller space of all temporal networks that embed a given collection of rooted binary phylogenetic trees. More precisely, in the context of their two papers, the authors considered temporal networks to be phylogenetic networks that satisfy the following three constraints:
speciation events occur successively,
reticulation events occur instantaneously, and
each non-leaf vertex has a child whose in-degree is one.
The second constraint implies that the three species that are involved in a reticulation event, i.e. the new species resulting from this event and its two distinct parents, must coexist in time. Moreover, a phylogenetic network that satisfies the third constraint (but not necessarily the first two constraints) is referred to as a tree-child network in the literature cardona12 . Intuitively, if a phylogenetic network is temporal, then one can assign a time stamp to each of its vertices such that the following holds for each edge in . If is a reticulation, then the time stamp assigned to is the same as the time stamp assigned to . Otherwise, the time stamp assigned to is strictly greater than that assigned to . Baroni et al. baroni06 showed that it can be checked in polynomial time whether or not a given phylogenetic network satisfies the first two constraints.
Humphries et al. humphries13 have established a new characterization to compute the minimum number of reticulation events that is needed to simultaneously embed an arbitrarily large collection of rooted binary phylogenetic trees into a temporal network. This characterization, which is formally defined in Section 2, is in terms of cherries, and the existence of a particular type of sequence on the leaves of the trees, called a cherry-picking sequence. It was shown that such a sequence for exists if and only if the trees in can simultaneously be embedded into a temporal network (humphries13, , Theorem 1). Moreover, a cherry-picking sequence for can be exploited further to compute the minimum number of reticulation events that is needed over all temporal networks. Importantly, not every collection is guaranteed to have a solution, i.e. there may be no cherry-picking sequence for and, hence no temporal network that embeds all trees in . It was left as an open problem by Humphries et al. humphries13 to analyze the computational complexity of deciding whether or not has a cherry-picking sequence for when .
In this paper, we make progress towards this question and show that it is NP-complete to decide if has a cherry-picking sequence for when . Translated into the language of phylogenetic networks, this result directly implies that it is computationally hard to decide if a collection of at least eight rooted binary phylogenetic trees can simultaneously be embedded into a temporal network. To establish our result, we use a reduction from a variant of the Intermezzo problem guttmann06 . On a more positive note, we show that deciding if has a cherry-picking sequence can be done in polynomial time if the number of trees and the number of cherries in each such tree are bounded by a constant. To this end, we explore connections between phylogenetic trees and automata theory and show how the problem at hand can be solved by using a deterministic finite automaton.
The remainder of the paper is organized as follows. The next section contains notation and terminology that is used throughout the paper. Section 3 establishes NP-completeness of a variant of the Intermezzo problem which is then, in turn, used in Section 4 to show that it is NP-complete to decide if has a cherry-picking sequence for when . In Section 5, we show that deciding if has a cherry-picking sequence is polynomial-time solvable if the number of cherries in each tree and the size of are bounded by a constant. We finish the paper with some concluding remarks in Section 6.
This section provides notation and terminology that is used in the subsequent sections. Throughout this paper, denotes a finite set.
Phylogenetic trees. A rooted binary phylogenetic -tree is a rooted tree with leaf set and, apart from the root which has degree two, all interior vertices have degree three. Furthermore, a pair of leaves of is called a cherry if and are leaves that are adjacent to a common vertex. Note that every rooted binary phylogenetic tree has at least one cherry. We denote by the number of cherries in . We now turn to a rooted binary phylogenetic tree with exactly one cherry. More precisely, we call a caterpillar if and the elements in can be ordered, say , so that is a cherry and, if denotes the parent of , then, for all , we have as an edge in , in which case we denote the caterpillar by . To illustrate, Figure 1 shows the caterpillar with cherry . Two rooted binary phylogenetic -trees and are said to be isomorphic if the identity map on induces a graph isomorphism on the underlying trees.
Subtrees. Now, let be a rooted binary phylogenetic -tree, and let be a subset of . The minimal rooted subtree of that connects all vertices in is denoted by . Furthermore, the rooted binary phylogenetic tree obtained from by contracting all non-root degree- vertices is the restriction of to and is denoted by . We also write or for short to denote . For a set of rooted binary phylogenetic -trees, we write (resp. ) when referring to the set (resp. ). Lastly, a rooted binary phylogenetic tree is pendant in if it can be detached from by deleting a single edge.
Cherry-picking sequences. Let be a set of rooted binary phylogenetic -trees with . We say that an ordering of the elements in , say , is a cherry-picking sequence for precisely if each with labels a leaf of a cherry in each tree that is contained in . Clearly, if , then has a cherry-picking sequence. However, if , then may or may not have a cherry-picking sequence.
We now formally state the decision problem that this paper is centered around.
Instance. A collection of rooted binary phylogenetic -trees.
Question. Does there exist a cherry-picking sequence for ?
The significance of CPS-Existence is the problem’s equivalence to the question whether or not all trees in can simultaneously be embedded into a rooted phylogenetic network that satisfies the three temporal constraints as alluded to in the introduction.
Automata and languages. Let be an alphabet. A language is a subset of all possible strings (also called words) whose symbols are in . More precisely, is a subset of , where the operator is the Kleene star. A deterministic finite automaton (or short automaton) is a tuple , where
is a finite set of states,
is a finite alphabet,
is a transition relation,
is the initial state, and
are final states.
A given automaton accepts a word if and only if is in a final state after having read all symbols from left to right, i.e.
The language that is recognized by is defined as the set of words that accepts. For the automata constructed in this paper, we have and being a total function that maps each pair of a state in and a symbol in to a state in . For a detailed introduction to automata theory and languages, see the book by Hopcroft and Ullman hopcroft79 .
3 A variant of the Intermezzo problem
In this section, we establish NP-completeness of a variant of the ordering problem Intermezzo. Let be a finite set, and let be an ordering on the elements in . For two elements and in , we write precisely if precedes in . With this notation in hand, we now formally state Intermezzo which was shown to be NP-complete via reduction from 3-SAT (guttmann06, , Lemma 1).
Instance. A finite set , a collection of pairs from , and a collection of pairwise-disjoint triples of distinct elements in .
Question. Does there exist a total linear ordering on the elements in such that for each in , and or for each in ?
Example. Consider the following instance of Intermezzo with three pairs and two disjoint triples (when viewed as sets):
A total linear ordering on the elements in that satisfies all constraints defined by and is
While each element can appear an unbounded number of times in the input of a given Intermezzo instance, this number is bounded from above by in the following Intermezzo variant.
Instance. A finite set , collections of pairs from , and collections of triples of distinct elements in such that, for each , the elements in are pairwise disjoint.
Question. Does there exist a total linear ordering on the elements in such that
Let be an instance of -Disjoint-Intermezzo, and let be an ordering on the elements of that satisfies the two ordering constraints for each pair and triple in the statement of -Disjoint-Intermezzo. We say that is an -Disjoint-Intermezzo ordering for .
We next show that -Disjoint-Intermezzo is NP-complete via reduction from the following restricted version of 3-SAT.
Instance. A set of variables, and a set of clauses, where each clause is a disjunction of exactly three literals, such that each variable appears negated exactly twice and unnegated exactly twice in .
Question. Does there exist a truth assignment for that satisfies each clause in ?
Berman et al. (berman03, , Theorem 1) established NP-completeness for 2P2N-3-SAT.
4-Disjoint-Intermezzo is NP-complete.
We show that the construction by Guttmann and Maucher (guttmann06, , Lemma 1), that was used to show that Intermezzo is NP-complete via reduction from 3-SAT, yields an instance of 4-Disjoint-Intermezzo if we reduce from 2P2N-3-SAT.
Using the same notation as Guttmann and Maucher (guttmann06, , Lemma 1), their construction is as follows. Let be an instance of 2P2N-3-SAT that is given by a set of variables and a set of clauses
where each . Furthermore, for , let denote the number such that . We define the following three sets:
where is an abbreviation of with . By construction, the elements in are pairwise-disjoint triples of distinct elements in and, so, the three sets , , and form an instance of Intermezzo.
Now, we show how the pairs and triples in can be partitioned into sets with , , and such that the elements in are pairwise disjoint. Recalling that is a set of pairwise-disjoint triples, we start by setting and . Furthermore, we set
and . By construction, it is easy to check that the pairs in are pairwise disjoint. Lastly, consider the remaining pairs
and observe that the only possibility for two pairs in to have a non-empty intersection is to have an element with in common. Now, since each is equal to an element in
and each element appears exactly twice negated and twice unnegated in , it follows that there is a partition of into and so that all pairs in the resulting two sets are pairwise disjoint. Setting completes the construction of an instance of 4-Disjoint-Intermezzo. Noting that it is straightforward to compute the partition
in polynomial time and that we did not modify the construction described by Guttmann and Maucher (guttmann06, , Lemma 1) itself, it follows from the same proof that has a satisfying truth assignment if and only if has a 4-Disjoint-Intermezzo ordering. ∎
Remark. By the construction of an instance of 4-Disjoint-Intermezzo in the proof of Theorem 3.1, we note that no pair or triple occurs twice and that, for each , we have . We will freely use these facts throughout the remainder of the paper.
4 Hardness of CPS-Existence
In this section, we show that the decision problem CPS-Existence is NP-complete for any collection of rooted binary phylogenetic trees on the same leaf set that consists of a constant number of trees with . To establish the result, we use a reduction from 4-Disjoint-Intermezzo.
Let be an instance of 4-Disjoint-Intermezzo. Using the same notation as in the definition of -Disjoint-Intermezzo, let
and let . For each , we next construct two rooted binary phylogenetic trees. Let be the subset of that precisely contains each element of that is neither contained in an element of nor contained in an element of
Furthermore, let and both be the caterpillar shown in Figure 1. Setting , let and be the two rooted binary phylogenetic trees obtained from and that result from the following four-step process.
For each in turn, replace the leaf in (resp. ) with the 3-taxon tree on the top left (resp. bottom left) in Figure 2 and increment by one.
For each with in turn, replace the leaf in (resp. ) with the 8-taxon tree on the top right (resp. bottom right) in Figure 2 and increment by one.
For each in turn, replace the leaf in and with the cherry and increment by one.
For each element in , replace the leaf label in and with .
We call the set of intermezzo trees associated with . The next observation is an immediate consequence from the above construction and the fact that, for each , the elements in and are pairwise disjoint.
For an instance of 4-Disjoint-Intermezzo, the set of intermezzo trees associated with consists of eight pairwise non-isomorphic rooted binary phylogenetic trees whose set of leaves is .
We now establish the main result of this section.
Let be a collection of rooted binary phylogenetic -trees. CPS-Existence is NP-complete for .
Clearly, CPS-Existence for is in NP because, given an ordering on the elements in , we can decide in polynomial time if is a cherry-picking sequence for . Let be an instance of 4-Disjoint-Intermezzo, and let be the set of eight intermezzo trees that are associated with . Note that each tree in can be constructed in polynomial time and has a size that is polynomial in . The remainder of the proof essentially consists of establishing the following claim.
Claim. is a ‘yes’-instance of 4-Disjoint-Intermezzo if and only if has a cherry-picking sequence.
First, suppose that has a cherry-picking sequence. Let be a cherry-picking sequence for , and let be the subsequence of of length that contains each element in . We next show that is a -Disjoint-Intermezzo ordering for . Let be an element of some with , and let , with , be the unique leaf label of and such that is the leaf set of a pendant subtree of and . By construction of and , it is easily seen that exists and in . Hence, in . Turning to the triples, let be an element of some with , and let , with , be the unique leaf label of and such that is the leaf set of a pendant subtree of and . Again, by construction, exists. Let and, similarly, let . It is straightforward to check that each cherry-picking sequence for and satisfies either
Hence, as and are pendant in and , respectively, we have , or in and, consequently, in . Since the above argument holds for each pair and each triple, it follows that is a 4-Disjoint-Intermezzo ordering for and, so, is a ‘yes’-instance.
Conversely, suppose that is a ‘yes’-instance of 4-Disjoint-Intermezzo. Let be a 4-Disjoint-Intermezzo ordering on the elements of . To ease reading, let
Modify as follows to obtain an ordering .
Concatenate with the sequence .
For each in , do one of the following two depending on the order of , , and in . If in , then replace with and replace with . Otherwise, if , replace with and replace with .
Since is a 4-Disjoint-Intermezzo ordering with or for each , it follows from the construction of from that is an ordering on the elements in . It remains to show that is a cherry-picking sequence for . First, consider a pendant subtree with leaf set in and for some . By construction, is a pair in and, so, we have in and in . Second, consider a pendant subtree with leaf set in and for some . By construction, is a triple in and, so, we have either in and
in , or in and
in . Third, consider a pendant subtree with leaf set in and for some . By construction, we have in . Fourth, if , then, as has a 4-Disjoint-Intermezzo ordering, there does not exist a pair in for some . Lastly, observe that is a suffix of and that, for any two trees, say and in , we have that and are isomorphic. Since is a 4-Disjoint-Intermezzo ordering, it is now straightforward to check that is a cherry-picking sequence of . This establishes the proof of the claim and, thereby, the theorem.∎
The next corollary shows that CPS-Existence is not only NP-complete for a collection of eight rooted binary phylogenetic trees on the same leaf set, but for any such collection with a fixed number of trees with .
Let be a collection of rooted binary phylogenetic -trees. CPS-Existence is NP-complete for any fixed with .
Clearly, CPS-Existence for with is in NP. To establish the corollary, we show how one can modify the reduction that is described prior to Theorem 4.2 to obtain a set of rooted binary phylogenetic trees from an instance of 4-Disjoint-Intermezzo.
Let be an instance of 4-Disjoint-Intermezzo. Throughout the remainder of the proof, we assume that there exists an such that . Otherwise, since and is fixed, it follows that has a constant number of pairs and triples with and is solvable in polynomial time.
Now, let and with be a collection of pairs and triples, respectively, such that . Theorem 4.2 establishes the result for when . We may therefore assume that and consider two cases. First, suppose that is even. Replace and in with a partition of into sets. Each of the resulting new sets can be split naturally into a collection of pairs and a collection of triples of which at most one is empty. This results in
collections of pairs and triples, respectively. Now, for each and with , construct two rooted binary phylogenetic trees as described in the definition of the set of intermezzo trees associated with . This yields
pairwise non-isomorphic trees. Second, suppose that
is odd. Replaceand in with a partition of into sets. Additionally, add and . Analogous to the first case, this results in
collections of pairs and triples, respectively. Again, for each and with construct two rooted binary phylogenetic trees as described in the definition of the set of intermezzo trees associated with . Noting that the two trees for and are isomorphic, it follows that the construction yields
pairwise non-isomorphic trees. Since the proof of Theorem 4.2 generalizes to a set of intermezzo trees, the corollary now follows for both cases.∎
5 Bounding the number of cherries
The main result of this section is the following theorem.
Let be a collection of rooted binary phylogenetic -trees. Let be the maximum element in . Then solving CPS-Existence for takes time
where . In particular, the running time is polynomial in if and are constant.
Let be a rooted binary phylogenetic -tree. We denote by the recursively defined set of trees that contains and , and that satisfies the following property.
(P) If a tree is in and is a cherry in , then and are also contained in .
We refer to as the set of cherry-picked trees of . Intuitively, contains each tree that can be obtained from by repeatedly deleting a leaf of a cherry.
To establish Theorem 5.1, we consider the set of cherry-picked trees of
. First, we develop a new vector representation for each tree inand show that the size of is at most . We then construct an automaton whose number of states is and that recognizes whether or not a word that contains each element in precisely once is a cherry-picking sequence for . Lastly, we show how to use a product automaton construction to solve CPS-Existence for a set of rooted binary phylogenetic -trees in time that is polynomial if the number of cherries and the number of trees in is bounded by a constant.
We start with a simple lemma, which shows that deleting a leaf of a cherry never increases the number of cherries.
Let be a rooted binary phylogenetic -tree, and let be an element of a cherry in . Then,
Let be the unique element in such that is a cherry in . Observe that each cherry of other than is also a cherry of . Now, let be the parent of the parent of in , and let be the child of that is not the parent of . If is a leaf, then it is easily checked that is a cherry in and, so . On the other hand, if is not a leaf, then is not part of a cherry in and, so, . ∎
We now define a labeled tree that will play an important role throughout the remainder of this section. Let be a rooted binary phylogenetic -tree with cherries . Obtain a tree from as follows.
Set to be .
Delete all leaves of that are not part of a cherry.
Suppress any resulting degree-2 vertex.
If the root, say , has degree one, delete .
For each cherry with , label the parent of and with , and delete the two leaves and .
Bijectively label the non-leaf vertices of with .
We call the index tree of . By construction, is a labeled rooted binary tree that is unique up to relabeling the internal vertices. To illustrate, an example of the construction of an index tree is shown in Figure 3. The next observation follows immediately from the construction of an index tree.
Let be a rooted binary phylogenetic tree, and let be the index tree associated with . The size of is . In particular, if the number of cherries in is constant, the size of is .
We next define a particular vector relative to a given set. Let be a finite set, let be an element that is not in , and let be a non-negative integer. We call
an -vector if each element in appears at most once in , each is an element in , and each is an element in . Now consider the following two -vectors:
We say that has the suffix-property relative to if, for each , the vector component is equal to or satisfies each of the following equations
Lastly, if has the property that for each , we call the empty vector. Note that the empty vector satisfies the suffix-property relative to every -vector.
Building on the definition of an -vector, we now describe a vector representation of a rooted binary phylogenetic tree that can be constructed by using its index tree as a guide. Roughly, the representation associates a caterpillar-type structure to each vertex in the index tree. Let be a rooted binary phylogenetic -tree, let , and let . For two vertices and in , we say that (resp. ) is an ancestor (resp. descendant) of (resp. ) if there is a directed path from to in . Throughout this section, we regard a vertex of to be an ancestor and a descendant of itself. The most recent common ancestor of is the vertex in whose set of descendants contains and no descendant of , except itself, has this property. We denote by . Now, let be the set of all cherries in . First, for each leaf in , let be the maximal pendant caterpillar in with cherry . We denote this by
where and . Second, for each non-leaf vertex labeled in with , let be the vertex in such that