In our quest for faithfully describing evolutionary histories, we are currently witnessing a shift from the representation of ancestral histories by phylogenetic (evolutionary) trees towards phylogenetic networks. The latter not only represent speciation events but also non-tree like events such as hybridisation and horizontal gene transfer that have played an important role throughout the evolution of certain groups of organisms as for example in plants and fish [10, 17, 18, 23].
In this paper, we focus on a problem that is related to the reconstruction of phylogenetic networks. Called Minimum Hybridisation and formally stated at the end of this section, this problem was first introduced by Baroni et al. . While Minimum Hybridisation was historically motivated by attempting to quantify hybridisation events, it is now more broadly regarded as a tool to quantify all non-tree like events to which we collectively refer to as reticulation events. Pictorially speaking, Minimum Hybridisation aims at the reconstruction of a phylogenetic network that simultaneously embeds a given set of phylogenetic trees while minimising the number of reticulation events that are represented by vertices in the network whose in-degree is at least two. More formally, the problem is based on the following underlying question. Given a collection of rooted phylogenetic trees on the same set of taxa that have correctly been reconstructed for different parts of the species’ genomes, what is the smallest number of reticulation events that is needed to explain ? Over the last ten years, we have seen significant progress in characterising and computing this minimum number for when (e.g. see [1, 3, 4, 8, 16, 25]
). However, except for some heuristic approaches[7, 26], less is known for when . This is due to the fact that the notion of agreement forests, which underlies almost all results that are related to Minimum Hybridisation, appears to be ungeneralisable to more than two trees.
Previously, together with Humphries, we introduced cherry-picking sequences and characterised a restricted version of Minimum Hybridisation for being binary and of arbitrary size . Instead of minimising the number of reticulation events needed to explain over the space of all rooted phylogenetic networks, this restricted version only considers binary temporal tree-child networks. Such networks are the binary intersection of the classes of temporal networks and tree-child networks introduced by Moret et al.  and Cardona et al. , respectively. Disadvantageously, this restriction is so strong that not even if are we guaranteed to have a solution, i.e. there may be no such network explaining [11, Figure 2].
Here, we advance our work on cherry-picking sequences and establish two new characterisations to quantify the amount of reticulation events that are needed to explain a set of (not necessarily binary) phylogenetic trees. The first characterisation solves the problem over the space of tree-child networks. Unlike temporal networks, we show that every collection of rooted phylogenetic trees has a solution, i.e. the trees in can simultaneously be embedded into a tree-child network. Subsequently, we extend this characterisation to the space of all rooted phylogenetic networks and, hence, provide the first characterisation for Minimum Hybridisation in its most general form. Both characterisations are based on computing a cherry-picking sequence for , while the latter characterisation makes also use of an operation that attaches auxiliary leaves to the trees in .
In addition to the two new characterisations, we return back to agreement forests and investigate why they seem to be of limited use to solve Minimum Hybridisation for an arbitrary size set of rooted phylogenetic trees. Roughly speaking, given , one can compute a particular type of agreement forest of smallest size and, if , then each but one component in contributes exactly one to the minimum number of reticulation events that is needed to explain . On the other hand, if , the contribution of each component in to this minimum number is much less clear. Motivated by this drawback of agreement forests, we consider a set of rooted binary phylogenetic trees as well as the agreement forest induced (formally defined in Section 5) by a phylogenetic network that explains and minimises the number of reticulations events and ask whether or not, it is computationally hard to calculate the minimum number of reticulation events that is needed to explain . We call the associated decision problem Scoring Optimum Forest. This problem was first mentioned in , where the authors conjecture that Scoring Optimum Forest is NP-complete. Using the machinery of cherry-picking sequences, we show that Scoring Optimum Forest is NP-complete for when one considers the smaller space of tree-child networks.
The paper is organised as follows. The remainder of the introduction contains some definitions and preliminaries on phylogenetic networks. In Section 2, we state the two new characterisations in terms of cherry-picking sequences. The first optimises Minimum Hybridisation within the space of tree-child networks and the second optimises Minimum Hybridisation within the space of all phylogenetic networks. The second characterisation is an extension of the first by additionally allowing the attachment of auxiliary leaves. We then establish proofs for both characterisations in Section 3 as well as a formal description of the analogous algorithm. In Section 4, we establish an upper bound on the number of auxiliary leaves that, given a collection of phylogenetic trees, are needed to characterise Minimum Hybridisation over the space of all rooted phylogenetic networks. Lastly, in Section 5, we formally state the problem Scoring Optimum Forest and show that it is NP-complete. We finish the paper with some concluding remarks in Section 6.
Throughout the paper, denotes a non-empty finite set. A phylogenetic network on is a rooted acyclic digraph with no parallel edges that satisfies the following properties:
the (unique) root has out-degree two,
the set is the set of vertices of out-degree zero, each of which has in-degree one, and
all other vertices either have in-degree one and out-degree two, or in-degree at least two and out-degree one.
For technical reasons, if , we additionally allow to consist of the single vertex in . The set is the leaf set of and the vertices in are called leaves. We sometimes denote the leaf set of by . For two vertices and in , we say that is a parent of and is a child of if is an edge in . Furthermore, the vertices of in-degree at most one and out-degree two are tree vertices, while the vertices of in-degree at least two and out-degree one are reticulations. An edge directed into a reticulation is called a reticulation edge while each non-reticulation edge is called a tree edge. We say that is binary if each reticulation has in-degree exactly two. Lastly, a directed path in ending at a leaf is a tree path if every intermediate vertex in is a tree vertex.
A phylogenetic network on is tree child if each non-leaf vertex in is the parent of a tree vertex or a leaf. An example of two tree-child networks and is given at the bottom of Figure 1. Note that the phylogenetic network obtained from by deleting the leaf labelled 4 and suppressing the resulting degree-two vertex results in a network that is not tree child.
A rooted phylogenetic -tree is a rooted tree with no degree- vertices except possibly the root which has degree at least two, and with leaf set . If , then consists of the single vertex in . As for phylogenetic networks, the set is called the leaf set of and is denoted by . In addition, is binary if or, apart from the root which has degree two, all interior vertices have degree three. Since we are only interested in rooted phylogenetic trees and rooted binary phylogenetic trees in this paper, we will refer to such trees simply as phylogenetic trees and binary phylogenetic trees, respectively. For a phylogenetic -tree , we consider two types of subtrees. Let be a subset of . The minimal subtree of that connects all the leaves in is denoted by . Moreover, the restriction of to , denoted by , is the phylogenetic -tree obtained from by suppressing all degree-two vertices apart from the root. Lastly, for two phylogenetic -trees and , we say that is a refinement of if can be obtained from by contracting a possibly empty set of internal edges in . In addition, is a binary refinement of if is binary.
Let be a phylogenetic -tree. A phylogenetic network on with displays if, up to suppressing vertices with in-degree 1 and out-degree 1, there exists a binary refinement of that can be obtained from by deleting edges, leaves not in , and any resulting vertices of out-degree zero, in which case we call the resulting acyclic digraph an embedding of in . If is a collection of phylogenetic -trees, then displays if each tree in is displayed by . For example, the two phylogenetic networks at the bottom of Figure 1 both display each of the four trees shown in the top part of the same figure.
Let be a phylogenetic network with vertex set and root . The hybridisation number of , denoted , is the value
where denotes the in-degree of . For example, the phylogenetic networks and that are shown in Figure 1 have hybridisation number 3 and 4, respectively. Observe that each tree vertex and each leaf contributes zero to this sum, but each reticulation contributes . Furthermore, for a set of phylogenetic -trees, we denote by and , respectively, the values
Remark. While the above definition of a phylogenetic network is restricted to networks whose tree vertices have out-degree exactly two, we note that the results in this paper also hold for networks with tree vertices whose out-degree is at least two. More particularly, if a set of phylogenetic -trees is displayed by a phylogenetic network whose tree vertices have out-degree at least two, then, by “refining” such vertices, we can obtain a phylogenetic network whose tree vertices have out-degree exactly two, displays , and . Thus no generality is lost with this restriction.
We next formally state the two decision problems that this paper is centred around.
Minimum Tree-Child Hybridisation
Instance. A set of phylogenetic -trees and a positive integer .
Question. Does there exist a tree-child network on that displays such that ?
Instance. A set of phylogenetic -trees and a positive integer .
Question. Does there exist a phylogenetic network on that displays such that ?
We will see at the end of this section that, for any given set of phylogenetic -trees, Minimum Tree-Child Hybridisation has a solution, i.e. there exists a tree-child network that displays .
It was shown in  that Minimum Hybridisation is NP-hard, even for when consists of two rooted binary phylogenetic -trees. To see that Minimum Tree-Child Hybridisation is also computationally hard, we again consider this restricted version of the problem and recall the following observation that was first mentioned in  and can be derived by slightly modifying the proof of [2, Theorem 2].
Let be a collection of two binary phylogenetic -trees. If there exists a phylogenetic network that displays with , then there also exists a tree-child network that displays with .
The next theorem, whose straightforward proof is omitted, follows from Observation 1 and the fact that, given a tree-child network and a binary phylogenetic tree , it can be checked in polynomial time whether or not displays [14, 21].
The decision problem Minimum-Tree-Child Hybridisation is NP-complete.
We end this section by showing that every collection of phylogenetic trees can be displayed by a tree-child network. For , let be the unique binary phylogenetic tree on two leaves, and say, whose root is a vertex at the end of a pendant edge adjoined to the original root. Now for a positive integer , obtain from by adding an edge that joins a new vertex and a new leaf and, for each tree edge in , subdividing with a new vertex and adding the edge . We call the universal network on leaves and note that is unique up to relabelling its leaves.
Let be the universal network on with . Then is tree child and displays all binary phylogenetic -trees.
By construction of from it is straightforward to check that, as is tree child, is tree child. To see that displays all binary phylogenetic -trees, we use induction on . Clearly, displays the unique binary phylogenetic tree on two leaves. For , assume that the universal network on displays all binary phylogenetic -trees. Observe that can be obtained from by deleting , the parent of and all their incident edges, and suppressing all resulting vertices with in-degree 1 and out-degree-1. Now, let be a binary phylogenetic -tree, and let be . Furthermore, let be the subset of that consists of the descendant leaves of the parent of in . As displays , there exist an embedding of in and an edge in such that the set of descendants of in is precisely . If is a tree edge in , then it is easily checked that displays by construction. On the other hand, if is a reticulation edge in , then has out-degree 1 in . Let be the unique edge in that is directed out of . Note that, as is tree child, is a tree vertex in . Then, as is a tree edge in that is subdivided by a new vertex in the construction of from , it again follows that displays . This completes the proof of the theorem.
The next corollary is an immediate consequence of Theorem 1.2 and the fact that every phylogenetic tree has a binary refinement on the same leaf set.
Let be a set of phylogenetic -trees. There exists a tree-child network on that displays .
While every collection of phylogenetic -trees can be displayed by a tree-child network on , a simple counting argument shows that the analogous result is not true for binary tree-child networks. Specifically, a binary tree-child network on has at most reticulations [6, Proposition 1] and so displays at most distinct binary phylogenetic -trees. But for large enough , there are many more distinct binary phylogenetic -trees than . For related results, we refer the interested reader to .
2 Cherry-picking characterisations
In this section, we state the two cherry-picking characterisations whose proofs are given in the next section. Let be a phylogenetic -tree with root , where . If is a leaf of , we denote by the operation of deleting and its incident edge and, if the parent of in has out-degree 2, suppressing the resulting degree-two vertex. Note that if the parent of is and has out-degree 2, then denotes the operation of deleting and its incident edge, and then deleting and its incident edge. Observe that is a phylogenetic tree on . A -element subset of is a cherry of if and have the same parent. Clearly, every phylogenetic tree with at least two leaves contains a cherry. In this paper, we typically distinguish the leaves in a cherry, in which case we write
as the ordered pairdepending on the roles of and .
Let be a set of phylogenetic -trees. A sequence
of ordered pairs in is a cherry-picking sequence of if the following algorithm returns a set of phylogenetic trees each of which consists of a single vertex in .
Algorithm. Picking Cherries
Input. A set of phylogenetic -trees and a cherry-picking sequence
Output. A set of phylogenetic trees.
Set and, for each tree , set . Set .
Set to be the set of phylogenetic trees obtained from by performing exactly one of the following two operations for each tree :
If is a cherry of , then set .
Else, set .
If , increment by one and repeat Step 2; otherwise, return .
For all , we say that is obtained from by picking . Furthermore, if for each , we say that each ordered pair in is essential. The weight of , denoted , is the value . Observe that if is a cherry-picking sequence of , then
as each element in must appear as the first element in an ordered pair in .
A particular type of cherry-picking sequence underlies our characterisation of . To this end, let be a set of phylogenetic -trees. A cherry-picking sequence
for is called a tree-child sequence if and, for all , we have . Now, let be a tree-child sequence for . We call a minimum-tree-child sequence of if is of smallest value over all tree-child sequences of . This smallest value is denoted by . It will follow from the results in the next section (Lemma 3) that every collection of phylogenetic trees has a tree-child sequence and so is well defined.
Referring to Figure 1,
is a tree-child sequence with weight for the four trees shown at the top of this figure.
Remark. As noted in the introduction, cherry-picking sequences were introduced in . In the set-up of this paper, the difference is as follows. Instead of a cherry-picking sequence consisting of a set of ordered pairs, a cherry-picking sequence in  consists of an ordering of the elements in . Moreover, this ordering has the additional property that, in the step analogous to Step 2 of Picking Cherries, is part of a cherry of every tree in . At this step, is deleted from each tree in , and the iterative process continues. The weighting of such a sequence is based, across all , on the number of different cherries of which is part of. It is not difficult to see how this could be interpreted as a special type of tree-child sequence.
The first of our new characterisations is the next theorem. For a given set of phylogenetic -trees, it writes in terms of tree-child sequences for .
Let be a set of phylogenetic -trees. Then
To state the second characterisation, we require an additional concept. Let be a phylogenetic -tree. Consider the operation of adjoining a new leaf to in one of the following three ways.
Subdivide an edge of with a new vertex, say, and add the edge
View the root of as a degree-one vertex adjacent to the original root and add the edge .
Add the edge , where is an interior vertex of .
We refer to this operation as attaching a new leaf to . More generally, if is a finite set of elements such that is empty, then attaching to is the operation of attaching, in turn, each element in to to eventually obtain a phylogenetic tree on . We refer to as a set of auxiliary leaves. Lastly, attaching Z to a set of phylogenetic -trees is the operation of attaching to each tree in .
Let be a set of phylogenetic -trees. A tree-child sequence for is leaf added if it is a tree-child sequence of a set of phylogenetic trees obtained from by attaching a set of auxiliary leaves. We denote the minimum weight amongst all leaf-added tree-child sequences of by . Of course, , but this inequality can also be strict. To illustrate, consider the two sets and of phylogenetic trees shown in Figure 2. Now
is a tree-child sequence for of weight . Since can be obtained by attaching to , it follows that is a leaf-added tree-child sequence for and .
For a given set of phylogenetic -trees, the next theorem characterises in terms of leaf-added tree-child sequences.
Let be a set of phylogenetic -trees. Then
Let be a set of phylogenetic -trees. Let be a tree-child sequence for . Then there exists a tree-child network on that displays with satisfying the following properties:
If is a tree vertex in and not the parent of a reticulation, then there are leaves and at the end of tree paths starting at the children and of , respectively, such that is an element in .
If is a tree vertex in and the parent of a reticulation , then there are leaves and at the end of tree paths starting at and , respectively, such that .
be a tree-child sequence for . The proof is by induction on . If , then and each tree in consists of the single vertex in . It immediately follows that choosing to be the phylogenetic network consisting of the single vertex in establishes the lemma for .
Now suppose that , and that the lemma holds for all tree-child sequences for sets of phylogenetic trees on the same leaf set whose length is at most . Let
and let be the set of phylogenetic trees obtained from by picking .
First assume each tree in has the same leaf set, namely . Then is a tree-child sequence for . By induction, there is a tree-child network on that displays with and satisfies (i) and (ii). Since each tree in has the same leaf set, is a cherry in each tree in . Therefore, as displays a binary refinement of each tree in , the tree-child network obtained from by subdividing the edge directed into with a new vertex and adding the edge displays . Furthermore, as and satisfies (i) and (ii) relative to , we have and it is easily seen that satisfies (i) and (ii) relative to .
Now assume that not every tree in has the same leaf set. Let denote the subset of trees in whose leaf set is . Since is non-empty, there exists an ordered pair in whose first coordinate is . Note that is not in ; otherwise there is an ordered pair in whose second coordinate is and so is not a tree-child sequence for . Let be the first such ordered pair. Let be a tree in and, using , consider applying iterations of Picking Cherries to . Let denote the subset of leaves in that are deleted from in this process. Observe that, as is the second coordinate in , we have .
We next add to to obtain a phylogenetic -tree for which is a tree-child sequence. Let be the (unique) vertex of that is closest to the root with the property that is a descendant leaf of , and the child of on the path from to has all its descendant leaves in . Let be the binary phylogenetic -tree obtained from by adding the edge . We now show that is a tree-child sequence for . Suppose that is not a tree-child sequence for . Let be the parent of in . Then amongst the first ordered pairs in is an ordered pair of the form that is essential when, using , Picking Cherries is applied to , where is a descendant leaf of in . But then, each descendant leaf of is in , contradicting the choice of .
Repeating this placement of for each tree in , we obtain a set of phylogenetic -trees from . Let and observe that is a tree-child sequence for . Therefore, by induction, there is a tree-child network on that displays with and satisfies (i) and (ii).
Let denote the parent of in . If is a reticulation, let be the phylogenetic network obtained from by subdividing the edge directed into with a new vertex and adding the edge . Since is tree child and displays , it follows that is tree child and displays . Furthermore,
Additionally, as , it also follows that, as satisfies (i) and (ii) relative to , we have satisfies (i) and (ii) relative to .
Thus we may assume that is a tree vertex. Let denote the child of that is not in . If is a reticulation, then, as satisfies (ii), contains a cherry in which is the second coordinate. But is the first ordered pair in and so, as is tree child, is never the second coordinate in an ordered pair in ; a contradiction. Therefore is either a tree vertex or a leaf in . So, as satisfies (i) and no ordered pair has as the second coordinate, it follows that contains an ordered pair, say, where is the leaf at the end of a tree path in starting at . Now let be the phylogenetic network obtained from by subdividing the edges directed into and with new vertices and , respectively, and adding the edge . Since is tree child and , it is easily seen that is tree child and . Furthermore, displays as well as , and therefore . Thus displays . To see that satisfies (i) and (ii) relative to , it suffices to show that satisfies (ii) for and . Indeed, the two ordered pairs and in verify (ii) for and , respectively. This completes the proof of the lemma.
The next corollary immediately follows from Lemma 1.
Let be a set of phylogenetic -trees. Then .
For the proof of the converse of Corollary 2, we begin with an additional lemma. Let be a phylogenetic network, and let and be two leaves in . Generalising cherries to phylogenetic networks, we say that is a cherry in if and have a common parent. Moreover, we call a reticulated cherry if the parent of , say , and the parent of , say , are joined by a reticulation edge in which case we say that is the reticulation leaf relative to . We next define two operations on . First, reducing a cherry is the operation of deleting one of the two leaves in , and suppressing the resulting degree-2 vertex. Second, reducing a reticulated cherry is the operation of deleting the reticulation edge joining the parents of and and suppressing any resulting degree-2 vertices. The proof of the next lemma is similar to the analogous result for binary tree-child networks [5, Lemma 4.1] and is omitted.
Let be a tree-child network on . Then the following hold.
If , then contains either a cherry or a reticulated cherry.
If is obtained from by reducing either a cherry or a reticulated cherry, then is a tree-child network.
Let be a set of phylogenetic -trees. Then .
Let be a tree-child network on that displays . By Corollary 1, such a network exists. We establish the lemma by explicitly constructing a tree-child sequence for such that .
Let denote the root of , and let denote the reticulations of . Let denote the leaves at the end of tree paths in starting at , respectively. Observe that these paths are pairwise vertex disjoint. We now construct a sequence of ordered pairs as follows:
Set and to be the empty sequence. Set .
If consists of a single vertex , then set to be the concatenation of and , and return .
If is a cherry in , then
If one of and , say , equates to for some and is not a reticulation in , then set to be the concatenation of and .
Otherwise, set to be the concatenation of and , where .
Set to be the tree-child network obtained from by deleting , thereby reducing the cherry .
Increase by one and go to Step 2.
Else, there is a reticulated cherry in , where say is the reticulation leaf.
Set to be the concatenation of and .
Set to be the tree-child network obtained from by reducing the reticulated cherry .
Increase by one and go to Step 2.
First note that it is easily checked that the construction is well defined, that is, it returns a sequence of ordered pairs. Moreover, in each iteration of the above construction, it follows from Lemma 2 that is tree child. We next show that, if is a cherry in , and and equate to and , respectively, where and are elements in , then exactly one of and is a reticulation in . To see this, if and are both reticulations in , then and are not vertex disjoint in ; a contradiction. On the other hand, suppose neither and are reticulations in . Without loss of generality, we may assume is the first such cherry for which this holds. Since is tree child, and therefore has no tree vertex that is the parent of two reticulations, there is an iteration , in which the cherry is concatenated with , where , and has as a cherry but does not. If or , we contradict the construction by the choice of . Also, if , then we again contradict the construction. Hence, we may assume for the remainder of the proof that exactly one of and is a reticulation in .
be the sequence returned by the construction. We prove by induction on that is a tree-child sequence for whose weight is at most . If , then consists of the single vertex in and the construction correctly returns such a sequence.
Now suppose that , and consider the first iteration of the construction. Either is a cherry or a reticulated cherry of . If is a cherry, then is a cherry of each tree in . In this instance, let denote the set of phylogenetic -trees obtained from by picking , where . Observe that is a tree-child network on that displays .
Now assume that is a reticulated cherry with as the reticulation leaf. Let be the subset of trees in not displayed by and let . Note that is a cherry of each tree in . For each tree in , delete the edge incident with , suppress any resulting degree-2 vertex, and reattach to the rest of the tree containing by subdividing an edge with a new vertex and adding an edge joining this vertex and so that the resulting phylogenetic -tree is displayed by . It is easily seen that this is always possible. Let denote the resulting collection of trees obtained from . For this instance, let and observe that displays . To complete the induction it suffices to show that if
is a tree-child sequence for whose weight is at most , then is a tree-child sequence for whose weight is at most , that is, at most .
First assume that is a cherry of . Then, as is a cherry of each tree in , it follows by induction that is a cherry-picking sequence of . Furthermore, as is tree child and , is also tree child. Since only appears once as the first coordinate of an ordered pair in , we have
Now assume that is a reticulated cherry of with as the reticulation leaf. Without loss of generality, let denote the associated reticulation, so that . Since each tree in has as a cherry and is a tree-child sequence for , it follows that is a cherry-picking sequence for .
We next show that is tree child. If the in-degree of is at least three in , then exists in and so, by construction, does not appear as the second coordinate of an ordered pair in as well as in . Therefore, if the in-degree of is at least three in , then is a tree child.
Now suppose that the in-degree of is two in . To establish that is tree child, assume to the contrary that appears as the second coordinate of an ordered pair in . Let denote the first such ordered pair. Then, at some iteration , either is a cherry or a reticulated cherry of . If is a cherry of , then, since , we are in Step 3(a) in iteration of the construction and so the ordered pair should be ; a contradiction. On the other hand, if is a reticulated cherry of , then is the reticulation leaf of and, by construction of , one of the parents of in is the parent of two reticulations in , namely and the reticulation for which, by construction, there is a tree path starting at this reticulation and ending at ; a contradiction as is tree child. Hence is tree child. Furthermore, as and ,