Deciding the existence of a cherry-picking sequence is hard on two trees

Here we show that deciding whether two rooted binary phylogenetic trees on the same set of taxa permit a cherry-picking sequence, a special type of elimination order on the taxa, is NP-complete. This improves on an earlier result which proved hardness for eight or more trees. Via a known equivalence between cherry-picking sequences and temporal phylogenetic networks, our result proves that it is NP-complete to determine the existence of a temporal phylogenetic network that contains topological embeddings of both trees. The hardness result also greatly strengthens previous inapproximability results for the minimum temporal-hybridization number problem. This is the optimization version of the problem where we wish to construct a temporal phylogenetic network that topologically embeds two given rooted binary phylogenetic trees and that has a minimum number of indegree-2 nodes, which represent events such as hybridization and horizontal gene transfer. We end on a positive note, pointing out that fixed parameter tractability results in this area are likely to ensure the continued relevance of the temporal phylogenetic network model.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/12/2017

On the existence of a cherry-picking sequence

Recently, the minimum number of reticulation events that is required to ...
10/02/2021

Decomposing a graph into subgraphs with small components

The component size of a graph is the maximum number of edges in any conn...
03/15/2018

The complexity of comparing multiply-labelled trees by extending phylogenetic-tree metrics

A multilabeled tree (or MUL-tree) is a rooted tree in which every leaf i...
06/29/2018

Deciding the Closure of Inconsistent Rooted Triples is NP-Complete

Interpreting three-leaf binary trees or rooted triples as constraints y...
04/03/2020

Weakly displaying trees in temporal tree-child network

Recently there has been considerable interest in the problem of finding ...
01/18/2018

A universality theorem for allowable sequences with applications

Order types are a well known abstraction of combinatorial properties of ...
10/12/2012

Inferring the Underlying Structure of Information Cascades

In social networks, information and influence diffuse among users as cas...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In the field of phylogenetics it is common to represent the evolution of a set of species by a rooted phylogenetic tree; essentially a rooted, bifurcating tree whose leaves are bijectively labeled by [19]. Driven by the realization that evolution is not always treelike there has been growing attention for the construction of phylogenetic networks, which generalize phylogenetic trees to directed acyclic graphs [1, 11, 14, 20]. One well known optimization problem for phylogenetic networks is as follows: given a set of rooted phylogenetic trees on the same set of taxa , compute a phylogenetic network which displays (i.e. contains topological embeddings of) all the trees in , such that the reticulation number is minimized. When is restricted to being binary this is equivalent to minimizing the number of nodes of with indegree-2. This optimization model is known as minimum hybridization and it has been extensively studied in the last decade (see e.g. [2, 6, 16, 22, 24]). More recently variations of minimum hybridization have been proposed which constrain the topology of to be more biologically relevant. One such constraint is to demand that is temporal [17]. Informally, a phylogenetic network is temporal if (i) the nodes of can be labeled with times, such that nodes of indegree-2 have contemporaneous parents, and time moves strictly forwards along treelike parts of the network; and (ii) each non-leaf vertex has a child whose indegree is 1. Property (ii) by itself is referred to as tree-child in the literature [5]. It has been shown that when it is NP-hard to solve the minimum temporal-hybridization number problem to optimality [13]. To establish the result, the authors proved that the problem is in fact APX-hard, which implies that for some constant it is not possible in polynomial time to approximate the optimum within a factor of , unless P=NP [18].

A more fundamental question remained, however, open: is it possible in polynomial time to determine if any temporal phylogenetic network exists that displays the input trees, regardless of how large is [12, 21]? Here we settle this question by showing that, even for , it is NP-complete to determine whether such a network exists. We prove this by using the cherry-picking characterization of temporal phylogenetic networks introduced in [12]. There it was shown that, given an arbitrarily large set of rooted binary phylogenetic trees on , there exists a temporal phylogenetic network that displays each tree in precisely if has a so-called cherry-picking sequence. Informally, a cherry-picking sequence on is an elimination order on that deletes one element of at a time, where at each step only elements can be deleted which are in a cherry of every tree in [12]. We show here that the seminal NP-hard problem 3-SAT [15] can be reduced to the question of whether two trees permit a cherry-picking sequence. This improves upon a recent result by two of the present authors which shows that, for , it is NP-complete to determine whether has a cherry-picking sequence [8]. Our hardness result is highly non-trivial and requires extensive gadgetry; to clarify we include an explicit example of the construction after the main proof.

As we discuss in the final section of the paper, this result has quite significant negative consequences: given that the decision problem is already hard, the minimum temporal-hybridization number problem is in some sense “effectively inapproximable”, even for two trees. This greatly strengthens the earlier APX-hard inapproximability result. Nevertheless, as we subsequently point out, positive fixed parameter tractability (FPT) [7] results for the minimum temporal-hybridization number problem do already exist [12] and our results emphasize the importance of further developing such algorithms, since fixed parameter tractability forms the most promising remaining avenue towards practical exact methods.

2 Preliminaries

A rooted binary phylogenetic tree on a set of taxa , where , is a rooted, connected, directed tree with a unique root (a vertex of indegree-0 and outdegree-2), where the leaves (vertices with indegree-1 and outdegree-0) are bijectively labeled by , and where all interior vertices of the tree are indegree-1 and outdegree-2. If , we consider the single isolated node labeled by the unique element of , to also be a rooted binary phylogenetic tree. Since all phylogenetic trees considered in this paper are rooted and binary, we henceforth write tree for brevity, and draw no distinction between the elements of and the leaves they label. Let be a tree, and let be a set of trees. We use to denote the taxa set of and, similarly, we use to denote the union of taxa sets over all elements in , i.e. . Lastly, for two distinct elements and in , we call a cherry of if they have the same parent. A tree with a single cherry is referred to as a caterpillar.

Now, let be a tree on , and let be an arbitrary set. We write to denote the tree obtained from by taking the minimum subtree spanning the elements of and repeatedly suppressing all vertices with indegree-1 and outdegree-1. Furthermore, we also write or for short to denote . If , then is the null tree and is itself. For a set of trees on subsets of , we write (resp. ) when referring to the set (resp. ). Lastly, a rooted binary phylogenetic tree is pendant in if it can be detached from by deleting a single edge.

2.1 Cherry-picking sequence problem on trees with the same set of taxa

Figure 1: A cherry-picking sequence for the two trees and at the top is . The two trees in the middle have been obtained from and , respectively, by pruning , and the two trees at the bottom have been obtained from and by first pruning and, subsequently, pruning . While we can alternatively prune and, subsequently, , from and , note that no cherry-picking sequence exists for and whose first two elements are and .

We say that a taxon is in a cherry of if there exists some such that is a cherry of or consists of a single leaf . If is in a cherry of , we say that is picked (or pruned) from to denote the operation of replacing with . Given a set of trees , all on the same set of taxa , we say that a taxon is available (for picking) in if is in a cherry in each tree in . When this is the case, we say that is picked (or pruned) from to denote the operation of replacing with .

Let be a set of trees on the same set of taxa . A cherry-picking sequence is an order on , say , such that each with is available in . Such a sequence is not guaranteed to exist; if it does, we say that permits a cherry-picking sequence. It was shown in [8] that deciding whether such a sequence exists is NP-complete if . Note that, if , then always has a cherry-picking sequence. To illustrate, a cherry-picking sequence for the two trees that are shown at the top of Figure 1 is .

2.2 A more general cherry-picking sequence problem

Let be a set of trees, and let . Suppose we consider the variant of the problem described in Section 2.1 in which the trees in do not necessarily have the same set of taxa. In this case, some taxa may be missing from some trees. This requires us to generalize the concept of being in a cherry of a tree. We say that a taxon is in a cherry of a tree , if exactly one of the following conditions holds:

  1. or

  2. and is in a cherry of ,

(Note that, once again, this means that if is the only taxon in , then is vacuously considered to be in a cherry of .) It initially seems counter-intuitive to say, when condition 1 applies, that is “in” a cherry of . However, the idea behind this is that such trees do not constrain whether can be picked; they “do not care”. More formally, we say that a taxon is available in if it is in a cherry in each tree in . Similar to Section 2.1, we say that an order on , say is a cherry-picking sequence of if each with is available in . If a tree becomes the null tree due to all its taxa being pruned away then this tree plays no further role. Moreover we note that, if all trees in have the same set of taxa, then the more general definition of a cherry-picking sequence given in this subsection and that will be used throughout the rest of this paper coincides with that given in Section 2.1.

3 Main results

In this section, we establish the main result of this paper. We start with two lemmas.

Lemma 1

Let be a set of trees on not necessarily the same set of taxa. Then we can construct in polynomial time a set of trees all on the same set of taxa, such that has a cherry-picking sequence if and only if does.

Proof

Let , and let be the set of taxa that are missing from at least one input tree. Let be a disjoint copy of this set. Every modified tree will have taxon set . The idea is as follows. Let be an arbitary rooted binary tree on . For each input tree , we start by joining and beneath a root, and then join this new tree and together beneath a root. Next, for each that is missing from , we add by subdividing the edge that feeds into and attaching there (so and become siblings). For an example, see Figure 2. We call the set of trees constructed in this way . The high-level idea is that if a tree does not contain some taxon , we attach just above and thus ensure that, trivially, is in a cherry in that tree (i.e. together with ). So does “not care” about and will not obstruct it from being pruned.

Figure 2: The construction described in Lemma 1. Here , the set of taxa missing from at least one tree, is . In each modified tree the artificially added members of are circled; note that they are always in cherries. A cherry-picking sequence for the original trees is . A corresponding sequence for the modified trees is .

First, assume that has a cherry-picking sequence . (We show that has a cherry-picking sequence). We start by applying exactly the same sequence of pruning operations to . These picking operations will always be possible because, if a taxon is missing from a tree , it will be in a cherry together with in the corresponding tree of . After doing this, all the trees will be isomorphic and have the same set of taxa: . At this point these remaining taxa can be pruned in bottom-up fashion (since two isomorphic trees always have a cherry-picking sequence). Hence has a cherry-picking sequence. Note that the taxon is included to ensure that if, during , a tree has been pruned down to a single taxon, this taxon can still be pruned in the corresponding tree of (because it is sibling to ).

In the other direction, let be a cherry-picking sequence for . Let be the sequence obtained by deleting all taxa from that are not in . Let be an arbitrary element of and let be the position of in . Let be the prefix of that has been pruned from prior to , and let (where ) be the prefix of that has been pruned prior to . We claim that, if is available in , then it is also available in . To see this, let be an arbitrary tree in . If , then (by definition) is in a cherry of . If is the only taxon in , then it is (also by definition) in a cherry. So the only case remaining is that and . Let be the tree from that corresponds to . The critical observation here is that, by construction, occurs as a pendant subtree of . So if was not in a cherry of , it would not be in a cherry of (which violates the assumption). Hence, is in a cherry of . Due to the arbitrary choice of and , it follows that is a cherry-picking sequence for .

It remains to show that the reduction is polynomial time. Observe that, depending on the instance, the size of can be dominated by or . Each of the trees in contains taxa, where , and the transformation itself involves straightforward operations, so overall the reduction takes poly(, ) time.

Let be a set of rooted binary trees, and let and be two trees in such that . Furthermore, let and be the root vertex of and , respectively. Obtain a new tree from and in the following way.

  1. Create a new vertex and add new edges and .

  2. Subdivide (resp. ) with a new vertex (resp. ) and add a new edge (resp. ), where and are two new taxa such that .

We call the resulting rooted binary tree the compound tree of and . To illustrate, Figure 3 depicts the compound tree of and .

Figure 3: The compound tree of two rooted binary trees and . The taxon (respectively, ) simply ensures that the last taxon pruned away in the (respectively, ) part is in a cherry.

The next lemma shows that, for a set of rooted binary trees, the replacement of two trees in with their compound tree preserves the existence and non-existence of a cherry-picking sequences for .

Lemma 2

Let be a set of rooted binary trees, and let and be two trees in such that . Let be the compound tree of and . Then has a cherry-picking sequence if and only if has a cherry-picking sequence.

Proof

To ease reading, let . Furthermore, let , and let and be the unique two taxa in that do not label a leaf in or .

Suppose that is a cherry-picking sequence for . Let be the maximum index of an element in such that and, similarly, let be the maximum index of an element in such that . Then contains a tree that is a single vertex labeled and contains a tree that is a single vertex labeled . Moreover, by the construction of , the set contains a tree with cherry and the set contains a tree with cherry . Since and are pendant subtrees in and is a cherry-picking sequence for , it now follows that

is a cherry-picking sequence for .

Conversely, suppose that is a cherry-picking sequence for . Let . Without loss of generality, we may assume that . Then, as and are only contained in the leaf set of , it is straightforward to check that

is a cherry-picking sequence for . ∎

Now we establish the main result of this paper.

Theorem 3.1

It is NP-complete to decide if two rooted binary phylogenetic trees and on have a cherry-picking sequence.

Proof

Given an order on , we can decide in polynomial time if, for each , is in a cherry in and . Hence, the problem of deciding if and have a cherry-picking sequence is in NP. To establish the theorem, we use a reduction from 3-Sat. This is the variant of Satisfiability where each clause contains exactly three literals, and the logical expression is in conjunctive normal form, i.e.,

where . The corresponding set of variables is denoted with

We reduce from the NP-complete version of 3-Sat in which no variable occurs more than once in a given clause. Such restricted instances can easily be obtained by a standard transformation as described in [10]. In the remainder of this proof, and refer to the number of variables and clauses in a restricted 3-Sat instance, respectively.

Now, given an instance of 3-Sat, we first construct a set of trees with overlapping taxa sets and show that has a satisfying truth assignment if and only if has a cherry-picking sequence. We then repeatedly apply Lemma 2 in order to replace with two trees and, finally, apply Lemma 1 to complete the proof of this theorem.

We start by describing the construction of that makes use of the introduction of a set of blocking taxa. As we will see later, each such taxon can only be pruned from after certain other taxa have been pruned first and so the main function of the blocking taxa is to be unavailable for pruning which in turn constraints the number of possibilities to construct a cherry-picking sequence from . An explicit example of the construction of is given subsequently to this proof.

Figure 4: Each variable , is represented by a single tree in and two trees in .

Variable gadget. We construct two sets and of trees. Each variable with adds one tree on four taxa to which is the tree shown in the solid box of Figure 4. Each such tree has two blocking taxa and, intuitively, encodes whether is set to be true or false, depending on whether or is pruned first. Moreover, each variable adds two caterpillars to . Relative to a fixed , the precise construction of these caterpillars is based on the definition of two particular tuples. Let (resp. ) be the tuple of all indices in ascending order that are elements in and that index clauses in which appears unnegated (resp. negated). Since no clause contains any variable more than once, the elements in (resp. ) are pairwise distinct.

Now the taxon set of one caterpillar contains , a new blocking taxon and, for each element in , a new taxon , while the taxon set of the other caterpillar contains , a new blocking taxon and, for each element in , a new taxon . The precise ordering of the leaves in both caterpillars is shown in the dashed box of Figure 4. It is easily checked that and, since each clause contains precisely three distinct variables, . Noting that the taxa sets of the trees in and only overlap in and , we have

(1)

distinct taxa over all trees in and .

Figure 5: Each clause is represented by three trees in and two trees in .

Clause gadget. We construct two sets and of trees. For each , consider the clause , where each is an element in with . Relative to , we add three three-taxon trees to which are shown in the solid box of Figure 5. The first such tree has taxon set where is an element in . Note that labels a leaf of a tree in while the other two taxa do not label a leaf of a tree in or . The other two trees in are constructed in an analogous way. Furthermore, for each , we add two five-taxon trees to which are shown in the dashed box of Figure 5. The taxa set of the first tree contains two new blocking taxa and the three previously encountered elements , while the second tree contains one new blocking taxon, the new taxon , and the three previously encountered elements . Similar to the variable gadgets, we now count the number of taxa in trees in and . As no two trees in or share a taxon, we have and . Moreover, since all taxa of trees in , except for the blocking taxa and elements in , are also taxa of trees in , we have

(2)

Formula gadget. We complete the construction of by constructing two caterpillars and which are shown in the solid and dashed box of Figure 6, and define

Summarizing the construction, we have . Moreover, by construction and Equations (1)-(2), it follows that . Now, since the three taxa , , and , which are common to and , are the only taxa of these two trees that are not contained in the taxa set of any other constructed tree, we have

(3)
Figure 6: The two trees and in the construction of from .

We next prove the following claim:

Claim 1. is satisfiable if and only if has a cherry-picking sequence.

First, suppose that is satisfiable. Let be a truth assignment for such that each clause is satisfied. We next describe a sequence of pruning operation. Noting that each taxon in is contained in the taxa sets of exactly two trees in (a fact that we freely use throughout the rest of this proof), it is straightforward to verify that this sequence implies a cherry-picking sequence for .

Part 1: Variable gadgets.

For each variable with do the following. If prune taxon from the two trees in whose taxa sets contain . On the other hand, if prune taxon from the two trees in whose taxa sets contain . Taken together, these pruning steps delete a single leaf of each tree in and a single leaf of half of the trees in .

Part 2: Clause gadgets.

Consider the set of trees resulting from the pruning described in Part 1. For each with , let be a subset of such that and, if is not satisfied by , then . Setting , process the three literals in from left to right in the following way.

  1. If is satisfied by , prune from the tree in whose taxa set contains and, noting that , prune from the tree in whose taxa set contains .

  2. If , prune , where if , if , and if , from the two trees in whose taxa sets contain .

  3. Prune , where if , if , and if , from the two trees in whose taxa sets contain .

Now prune from the tree in whose taxa set contains , and prune from . If , increment by one and repeat this process with the next clause. Intuitively, by definition of , the above process prunes exactly two elements in . Since each clause is satisfied by , this guarantees that we can prune each element in and, subsequently .

Part 3: Formula gadget and remaining taxa.

Consider the set of trees resulting from the pruning described in Part 2. We prune the remaining taxa as follows.

  1. In order, prune each of

    from a tree whose taxa set contains the respective blocking taxa and from . After all taxa have been pruned, each tree in is either the null tree or consists of a single vertex labeled for some .

  2. For each , prune the unique taxon with that has not been pruned in Part 2 from two trees in . Now, each tree in that is not the null tree consists of a single vertex labeled for some and .

  3. For each , note that one of labels a leaf of a cherry in a tree in while the other labels the leaf of a tree in that consists of a single vertex. In order, prune each of

    from the tree in whose taxa set contains the respective blocking taxa and from .

  4. In order, prune and from and .

  5. Consider the remaining trees in and observe that each such tree consists of exactly three leaves, two of which are blocking taxa that form a cherry. In order, prune each of

    from and the tree in whose taxa set contains the respective blocking taxon.

  6. For each , let be the unique element in that has not been pruned in Part 1. Prune from the two trees in whose taxa sets contain .

  7. For each in increasing order, consider each literal in with that is not satisfied by . By processing such literals from left to right in , prune from the two trees in whose taxa sets contain . It is easily seen that the corresponding tree in either consists of a single vertex or contains a cherry with a leaf labeled .

  8. Prune from and .

Now, relative to the elements in , we prune elements in Parts 1 and 3.6, all elements in

in Part 2, and all blocking taxa in Parts 3.1, 3.3, 3.4, 3.5, and 3.8. Additionally, in Parts 2.1 and 3.7 we prune taxa, and in Parts 2.2 and 3.2, we prune again taxa. Summing up, we prune

taxa, which is equal to the number of elements in .

Second, suppose that has a cherry-picking sequence . We write if and only if and if and only if . Further, let

We define a truth assignment as follows

In order to show that satisifies each clause of , we establish four necessary conditions that fulfills by construction.

  1. All taxa in are pruned earlier than any blocking taxon:

    (4)

    Argument: Observe that the arrangement of , , in and implies that all taxa in are pruned prior to any blocking taxon. Furthermore, we cannot prune any taxon in until we have pruned all taxa from except for , , and . We will freely use Condition 1 throughout the remainder of this proof.

  2. Let be a clause of . At least one taxon in is pruned earlier than . Stated more formally:

    (5)

    Argument: Consider the five trees in representing (see Figure 5). In order to prune , we have to prune all taxa in first. Since we can prune at most two taxa in prior to an element in , pruning all taxa in is only possible if at least one taxon in has been pruned previously.

  3. Let be any variable of . Recall the definition of the tuples and that is used in the construction of the variable gadget. If there exists a with for some , then is also pruned earlier than . Stated formally:

    (6)

    Argument: Consider a variable such that for some . Since there is no blocking taxon with , we have . Thus, is pruned from the associated caterpillar in that contains such that (see Figure 4).

    The following can be shown analogously. If there exists a for some , then is also pruned earlier than . Stated formally:

    (7)
  4. Let be any variable of . If is pruned earlier than some taxon in , then is pruned later than all taxa in , i.e.,

    (8)

    Argument: Consider a variable such that for some . Assume towards a contradiction that there is some such that . Then, one of the two blocking taxa and is pruned prior to (see Figure 4). But this is not possible since there is no blocking taxon with .

    As an immediate consequence of statement (8), we get the analogous statement for , i.e.,

    (9)

Now, we show that indeed satisfies each clause of . For each clause , we have for some (Condition 2). Since for some , we have if and if (Condition 3). Hence, by setting if and if , we satisfy at least one literal of each clause. Note that we can assign arbitrary truth values to variables with and for all . Here, we choose to set all these variables to . The truth assignment is consistent, since at least one taxon in is pruned later than all taxa in (Condition 4). Hence, the truth assignment is consistent and satisfies each clause of .

Folding into two trees on the same set of taxa.

The trees in and, similarly, the trees in