1 Introduction
In our quest for faithfully describing evolutionary histories, we are currently witnessing a shift from the representation of ancestral histories by phylogenetic (evolutionary) trees towards phylogenetic networks. The latter not only represent speciation events but also nontree like events such as hybridisation and horizontal gene transfer that have played an important role throughout the evolution of certain groups of organisms as for example in plants and fish [10, 17, 18, 23].
In this paper, we focus on a problem that is related to the reconstruction of phylogenetic networks. Called Minimum Hybridisation and formally stated at the end of this section, this problem was first introduced by Baroni et al. [2]. While Minimum Hybridisation was historically motivated by attempting to quantify hybridisation events, it is now more broadly regarded as a tool to quantify all nontree like events to which we collectively refer to as reticulation events. Pictorially speaking, Minimum Hybridisation aims at the reconstruction of a phylogenetic network that simultaneously embeds a given set of phylogenetic trees while minimising the number of reticulation events that are represented by vertices in the network whose indegree is at least two. More formally, the problem is based on the following underlying question. Given a collection of rooted phylogenetic trees on the same set of taxa that have correctly been reconstructed for different parts of the species’ genomes, what is the smallest number of reticulation events that is needed to explain ? Over the last ten years, we have seen significant progress in characterising and computing this minimum number for when (e.g. see [1, 3, 4, 8, 16, 25]
). However, except for some heuristic approaches
[7, 26], less is known for when . This is due to the fact that the notion of agreement forests, which underlies almost all results that are related to Minimum Hybridisation, appears to be ungeneralisable to more than two trees.Previously, together with Humphries, we introduced cherrypicking sequences and characterised a restricted version of Minimum Hybridisation for being binary and of arbitrary size [12]. Instead of minimising the number of reticulation events needed to explain over the space of all rooted phylogenetic networks, this restricted version only considers binary temporal treechild networks. Such networks are the binary intersection of the classes of temporal networks and treechild networks introduced by Moret et al. [19] and Cardona et al. [6], respectively. Disadvantageously, this restriction is so strong that not even if are we guaranteed to have a solution, i.e. there may be no such network explaining [11, Figure 2].
Here, we advance our work on cherrypicking sequences and establish two new characterisations to quantify the amount of reticulation events that are needed to explain a set of (not necessarily binary) phylogenetic trees. The first characterisation solves the problem over the space of treechild networks. Unlike temporal networks, we show that every collection of rooted phylogenetic trees has a solution, i.e. the trees in can simultaneously be embedded into a treechild network. Subsequently, we extend this characterisation to the space of all rooted phylogenetic networks and, hence, provide the first characterisation for Minimum Hybridisation in its most general form. Both characterisations are based on computing a cherrypicking sequence for , while the latter characterisation makes also use of an operation that attaches auxiliary leaves to the trees in .
In addition to the two new characterisations, we return back to agreement forests and investigate why they seem to be of limited use to solve Minimum Hybridisation for an arbitrary size set of rooted phylogenetic trees. Roughly speaking, given , one can compute a particular type of agreement forest of smallest size and, if , then each but one component in contributes exactly one to the minimum number of reticulation events that is needed to explain . On the other hand, if , the contribution of each component in to this minimum number is much less clear. Motivated by this drawback of agreement forests, we consider a set of rooted binary phylogenetic trees as well as the agreement forest induced (formally defined in Section 5) by a phylogenetic network that explains and minimises the number of reticulations events and ask whether or not, it is computationally hard to calculate the minimum number of reticulation events that is needed to explain . We call the associated decision problem Scoring Optimum Forest. This problem was first mentioned in [13], where the authors conjecture that Scoring Optimum Forest is NPcomplete. Using the machinery of cherrypicking sequences, we show that Scoring Optimum Forest is NPcomplete for when one considers the smaller space of treechild networks.
The paper is organised as follows. The remainder of the introduction contains some definitions and preliminaries on phylogenetic networks. In Section 2, we state the two new characterisations in terms of cherrypicking sequences. The first optimises Minimum Hybridisation within the space of treechild networks and the second optimises Minimum Hybridisation within the space of all phylogenetic networks. The second characterisation is an extension of the first by additionally allowing the attachment of auxiliary leaves. We then establish proofs for both characterisations in Section 3 as well as a formal description of the analogous algorithm. In Section 4, we establish an upper bound on the number of auxiliary leaves that, given a collection of phylogenetic trees, are needed to characterise Minimum Hybridisation over the space of all rooted phylogenetic networks. Lastly, in Section 5, we formally state the problem Scoring Optimum Forest and show that it is NPcomplete. We finish the paper with some concluding remarks in Section 6.
Throughout the paper, denotes a nonempty finite set. A phylogenetic network on is a rooted acyclic digraph with no parallel edges that satisfies the following properties:

the (unique) root has outdegree two,

the set is the set of vertices of outdegree zero, each of which has indegree one, and

all other vertices either have indegree one and outdegree two, or indegree at least two and outdegree one.
For technical reasons, if , we additionally allow to consist of the single vertex in . The set is the leaf set of and the vertices in are called leaves. We sometimes denote the leaf set of by . For two vertices and in , we say that is a parent of and is a child of if is an edge in . Furthermore, the vertices of indegree at most one and outdegree two are tree vertices, while the vertices of indegree at least two and outdegree one are reticulations. An edge directed into a reticulation is called a reticulation edge while each nonreticulation edge is called a tree edge. We say that is binary if each reticulation has indegree exactly two. Lastly, a directed path in ending at a leaf is a tree path if every intermediate vertex in is a tree vertex.
A phylogenetic network on is tree child if each nonleaf vertex in is the parent of a tree vertex or a leaf. An example of two treechild networks and is given at the bottom of Figure 1. Note that the phylogenetic network obtained from by deleting the leaf labelled 4 and suppressing the resulting degreetwo vertex results in a network that is not tree child.
A rooted phylogenetic tree is a rooted tree with no degree vertices except possibly the root which has degree at least two, and with leaf set . If , then consists of the single vertex in . As for phylogenetic networks, the set is called the leaf set of and is denoted by . In addition, is binary if or, apart from the root which has degree two, all interior vertices have degree three. Since we are only interested in rooted phylogenetic trees and rooted binary phylogenetic trees in this paper, we will refer to such trees simply as phylogenetic trees and binary phylogenetic trees, respectively. For a phylogenetic tree , we consider two types of subtrees. Let be a subset of . The minimal subtree of that connects all the leaves in is denoted by . Moreover, the restriction of to , denoted by , is the phylogenetic tree obtained from by suppressing all degreetwo vertices apart from the root. Lastly, for two phylogenetic trees and , we say that is a refinement of if can be obtained from by contracting a possibly empty set of internal edges in . In addition, is a binary refinement of if is binary.
Let be a phylogenetic tree. A phylogenetic network on with displays if, up to suppressing vertices with indegree 1 and outdegree 1, there exists a binary refinement of that can be obtained from by deleting edges, leaves not in , and any resulting vertices of outdegree zero, in which case we call the resulting acyclic digraph an embedding of in . If is a collection of phylogenetic trees, then displays if each tree in is displayed by . For example, the two phylogenetic networks at the bottom of Figure 1 both display each of the four trees shown in the top part of the same figure.
Let be a phylogenetic network with vertex set and root . The hybridisation number of , denoted , is the value
where denotes the indegree of . For example, the phylogenetic networks and that are shown in Figure 1 have hybridisation number 3 and 4, respectively. Observe that each tree vertex and each leaf contributes zero to this sum, but each reticulation contributes . Furthermore, for a set of phylogenetic trees, we denote by and , respectively, the values
and
Remark. While the above definition of a phylogenetic network is restricted to networks whose tree vertices have outdegree exactly two, we note that the results in this paper also hold for networks with tree vertices whose outdegree is at least two. More particularly, if a set of phylogenetic trees is displayed by a phylogenetic network whose tree vertices have outdegree at least two, then, by “refining” such vertices, we can obtain a phylogenetic network whose tree vertices have outdegree exactly two, displays , and . Thus no generality is lost with this restriction.
We next formally state the two decision problems that this paper is centred around.
Minimum TreeChild Hybridisation
Instance. A set of phylogenetic trees and a positive integer .
Question. Does there exist a treechild network on that displays such that ?
Minimum Hybridisation
Instance. A set of phylogenetic trees and a positive integer .
Question. Does there exist a phylogenetic network on that displays such that ?
We will see at the end of this section that, for any given set of phylogenetic trees, Minimum TreeChild Hybridisation has a solution, i.e. there exists a treechild network that displays .
It was shown in [3] that Minimum Hybridisation is NPhard, even for when consists of two rooted binary phylogenetic trees. To see that Minimum TreeChild Hybridisation is also computationally hard, we again consider this restricted version of the problem and recall the following observation that was first mentioned in [12] and can be derived by slightly modifying the proof of [2, Theorem 2].
Observation
Let be a collection of two binary phylogenetic trees. If there exists a phylogenetic network that displays with , then there also exists a treechild network that displays with .
The next theorem, whose straightforward proof is omitted, follows from Observation 1 and the fact that, given a treechild network and a binary phylogenetic tree , it can be checked in polynomial time whether or not displays [14, 21].
Theorem 1.1
The decision problem MinimumTreeChild Hybridisation is NPcomplete.
We end this section by showing that every collection of phylogenetic trees can be displayed by a treechild network. For , let be the unique binary phylogenetic tree on two leaves, and say, whose root is a vertex at the end of a pendant edge adjoined to the original root. Now for a positive integer , obtain from by adding an edge that joins a new vertex and a new leaf and, for each tree edge in , subdividing with a new vertex and adding the edge . We call the universal network on leaves and note that is unique up to relabelling its leaves.
Theorem 1.2
Let be the universal network on with . Then is tree child and displays all binary phylogenetic trees.
Proof
By construction of from it is straightforward to check that, as is tree child, is tree child. To see that displays all binary phylogenetic trees, we use induction on . Clearly, displays the unique binary phylogenetic tree on two leaves. For , assume that the universal network on displays all binary phylogenetic trees. Observe that can be obtained from by deleting , the parent of and all their incident edges, and suppressing all resulting vertices with indegree 1 and outdegree1. Now, let be a binary phylogenetic tree, and let be . Furthermore, let be the subset of that consists of the descendant leaves of the parent of in . As displays , there exist an embedding of in and an edge in such that the set of descendants of in is precisely . If is a tree edge in , then it is easily checked that displays by construction. On the other hand, if is a reticulation edge in , then has outdegree 1 in . Let be the unique edge in that is directed out of . Note that, as is tree child, is a tree vertex in . Then, as is a tree edge in that is subdivided by a new vertex in the construction of from , it again follows that displays . This completes the proof of the theorem.
The next corollary is an immediate consequence of Theorem 1.2 and the fact that every phylogenetic tree has a binary refinement on the same leaf set.
Corollary 1
Let be a set of phylogenetic trees. There exists a treechild network on that displays .
While every collection of phylogenetic trees can be displayed by a treechild network on , a simple counting argument shows that the analogous result is not true for binary treechild networks. Specifically, a binary treechild network on has at most reticulations [6, Proposition 1] and so displays at most distinct binary phylogenetic trees. But for large enough , there are many more distinct binary phylogenetic trees than . For related results, we refer the interested reader to [22].
2 Cherrypicking characterisations
In this section, we state the two cherrypicking characterisations whose proofs are given in the next section. Let be a phylogenetic tree with root , where . If is a leaf of , we denote by the operation of deleting and its incident edge and, if the parent of in has outdegree 2, suppressing the resulting degreetwo vertex. Note that if the parent of is and has outdegree 2, then denotes the operation of deleting and its incident edge, and then deleting and its incident edge. Observe that is a phylogenetic tree on . A element subset of is a cherry of if and have the same parent. Clearly, every phylogenetic tree with at least two leaves contains a cherry. In this paper, we typically distinguish the leaves in a cherry, in which case we write
as the ordered pair
depending on the roles of and .Let be a set of phylogenetic trees. A sequence
of ordered pairs in is a cherrypicking sequence of if the following algorithm returns a set of phylogenetic trees each of which consists of a single vertex in .
Algorithm. Picking Cherries
Input. A set of phylogenetic trees and a cherrypicking sequence
for .
Output. A set of phylogenetic trees.

Set and, for each tree , set . Set .

Set to be the set of phylogenetic trees obtained from by performing exactly one of the following two operations for each tree :

If is a cherry of , then set .

Else, set .


If , increment by one and repeat Step 2; otherwise, return .
For all , we say that is obtained from by picking . Furthermore, if for each , we say that each ordered pair in is essential. The weight of , denoted , is the value . Observe that if is a cherrypicking sequence of , then
as each element in must appear as the first element in an ordered pair in .
A particular type of cherrypicking sequence underlies our characterisation of . To this end, let be a set of phylogenetic trees. A cherrypicking sequence
for is called a treechild sequence if and, for all , we have . Now, let be a treechild sequence for . We call a minimumtreechild sequence of if is of smallest value over all treechild sequences of . This smallest value is denoted by . It will follow from the results in the next section (Lemma 3) that every collection of phylogenetic trees has a treechild sequence and so is well defined.
Referring to Figure 1,
is a treechild sequence with weight for the four trees shown at the top of this figure.
Remark. As noted in the introduction, cherrypicking sequences were introduced in [12]. In the setup of this paper, the difference is as follows. Instead of a cherrypicking sequence consisting of a set of ordered pairs, a cherrypicking sequence in [12] consists of an ordering of the elements in . Moreover, this ordering has the additional property that, in the step analogous to Step 2 of Picking Cherries, is part of a cherry of every tree in . At this step, is deleted from each tree in , and the iterative process continues. The weighting of such a sequence is based, across all , on the number of different cherries of which is part of. It is not difficult to see how this could be interpreted as a special type of treechild sequence.
The first of our new characterisations is the next theorem. For a given set of phylogenetic trees, it writes in terms of treechild sequences for .
Theorem 2.1
Let be a set of phylogenetic trees. Then
To state the second characterisation, we require an additional concept. Let be a phylogenetic tree. Consider the operation of adjoining a new leaf to in one of the following three ways.

Subdivide an edge of with a new vertex, say, and add the edge

View the root of as a degreeone vertex adjacent to the original root and add the edge .

Add the edge , where is an interior vertex of .
We refer to this operation as attaching a new leaf to . More generally, if is a finite set of elements such that is empty, then attaching to is the operation of attaching, in turn, each element in to to eventually obtain a phylogenetic tree on . We refer to as a set of auxiliary leaves. Lastly, attaching Z to a set of phylogenetic trees is the operation of attaching to each tree in .
Let be a set of phylogenetic trees. A treechild sequence for is leaf added if it is a treechild sequence of a set of phylogenetic trees obtained from by attaching a set of auxiliary leaves. We denote the minimum weight amongst all leafadded treechild sequences of by . Of course, , but this inequality can also be strict. To illustrate, consider the two sets and of phylogenetic trees shown in Figure 2. Now
is a treechild sequence for of weight . In fact, it follows from [13, 15] that (see Section 6 for details). On the other hand,
is a treechild sequence for of weight . Since can be obtained by attaching to , it follows that is a leafadded treechild sequence for and .
For a given set of phylogenetic trees, the next theorem characterises in terms of leafadded treechild sequences.
Theorem 2.2
Let be a set of phylogenetic trees. Then
3 Proofs of Theorems 2.1 and 2.2
In this section, we prove Theorems 2.1 and 2.2. Most of the work is in proving Theorem 2.1. We begin by showing that .
Lemma 1
Let be a set of phylogenetic trees. Let be a treechild sequence for . Then there exists a treechild network on that displays with satisfying the following properties:

If is a tree vertex in and not the parent of a reticulation, then there are leaves and at the end of tree paths starting at the children and of , respectively, such that is an element in .

If is a tree vertex in and the parent of a reticulation , then there are leaves and at the end of tree paths starting at and , respectively, such that .
Proof
Let
be a treechild sequence for . The proof is by induction on . If , then and each tree in consists of the single vertex in . It immediately follows that choosing to be the phylogenetic network consisting of the single vertex in establishes the lemma for .
Now suppose that , and that the lemma holds for all treechild sequences for sets of phylogenetic trees on the same leaf set whose length is at most . Let
and let be the set of phylogenetic trees obtained from by picking .
First assume each tree in has the same leaf set, namely . Then is a treechild sequence for . By induction, there is a treechild network on that displays with and satisfies (i) and (ii). Since each tree in has the same leaf set, is a cherry in each tree in . Therefore, as displays a binary refinement of each tree in , the treechild network obtained from by subdividing the edge directed into with a new vertex and adding the edge displays . Furthermore, as and satisfies (i) and (ii) relative to , we have and it is easily seen that satisfies (i) and (ii) relative to .
Now assume that not every tree in has the same leaf set. Let denote the subset of trees in whose leaf set is . Since is nonempty, there exists an ordered pair in whose first coordinate is . Note that is not in ; otherwise there is an ordered pair in whose second coordinate is and so is not a treechild sequence for . Let be the first such ordered pair. Let be a tree in and, using , consider applying iterations of Picking Cherries to . Let denote the subset of leaves in that are deleted from in this process. Observe that, as is the second coordinate in , we have .
We next add to to obtain a phylogenetic tree for which is a treechild sequence. Let be the (unique) vertex of that is closest to the root with the property that is a descendant leaf of , and the child of on the path from to has all its descendant leaves in . Let be the binary phylogenetic tree obtained from by adding the edge . We now show that is a treechild sequence for . Suppose that is not a treechild sequence for . Let be the parent of in . Then amongst the first ordered pairs in is an ordered pair of the form that is essential when, using , Picking Cherries is applied to , where is a descendant leaf of in . But then, each descendant leaf of is in , contradicting the choice of .
Repeating this placement of for each tree in , we obtain a set of phylogenetic trees from . Let and observe that is a treechild sequence for . Therefore, by induction, there is a treechild network on that displays with and satisfies (i) and (ii).
Let denote the parent of in . If is a reticulation, let be the phylogenetic network obtained from by subdividing the edge directed into with a new vertex and adding the edge . Since is tree child and displays , it follows that is tree child and displays . Furthermore,
Additionally, as , it also follows that, as satisfies (i) and (ii) relative to , we have satisfies (i) and (ii) relative to .
Thus we may assume that is a tree vertex. Let denote the child of that is not in . If is a reticulation, then, as satisfies (ii), contains a cherry in which is the second coordinate. But is the first ordered pair in and so, as is tree child, is never the second coordinate in an ordered pair in ; a contradiction. Therefore is either a tree vertex or a leaf in . So, as satisfies (i) and no ordered pair has as the second coordinate, it follows that contains an ordered pair, say, where is the leaf at the end of a tree path in starting at . Now let be the phylogenetic network obtained from by subdividing the edges directed into and with new vertices and , respectively, and adding the edge . Since is tree child and , it is easily seen that is tree child and . Furthermore, displays as well as , and therefore . Thus displays . To see that satisfies (i) and (ii) relative to , it suffices to show that satisfies (ii) for and . Indeed, the two ordered pairs and in verify (ii) for and , respectively. This completes the proof of the lemma.
The next corollary immediately follows from Lemma 1.
Corollary 2
Let be a set of phylogenetic trees. Then .
For the proof of the converse of Corollary 2, we begin with an additional lemma. Let be a phylogenetic network, and let and be two leaves in . Generalising cherries to phylogenetic networks, we say that is a cherry in if and have a common parent. Moreover, we call a reticulated cherry if the parent of , say , and the parent of , say , are joined by a reticulation edge in which case we say that is the reticulation leaf relative to . We next define two operations on . First, reducing a cherry is the operation of deleting one of the two leaves in , and suppressing the resulting degree2 vertex. Second, reducing a reticulated cherry is the operation of deleting the reticulation edge joining the parents of and and suppressing any resulting degree2 vertices. The proof of the next lemma is similar to the analogous result for binary treechild networks [5, Lemma 4.1] and is omitted.
Lemma 2
Let be a treechild network on . Then the following hold.

If , then contains either a cherry or a reticulated cherry.

If is obtained from by reducing either a cherry or a reticulated cherry, then is a treechild network.
Lemma 3
Let be a set of phylogenetic trees. Then .
Proof
Let be a treechild network on that displays . By Corollary 1, such a network exists. We establish the lemma by explicitly constructing a treechild sequence for such that .
Let denote the root of , and let denote the reticulations of . Let denote the leaves at the end of tree paths in starting at , respectively. Observe that these paths are pairwise vertex disjoint. We now construct a sequence of ordered pairs as follows:

Set and to be the empty sequence. Set .

If consists of a single vertex , then set to be the concatenation of and , and return .

If is a cherry in , then

If one of and , say , equates to for some and is not a reticulation in , then set to be the concatenation of and .

Otherwise, set to be the concatenation of and , where .

Set to be the treechild network obtained from by deleting , thereby reducing the cherry .

Increase by one and go to Step 2.


Else, there is a reticulated cherry in , where say is the reticulation leaf.

Set to be the concatenation of and .

Set to be the treechild network obtained from by reducing the reticulated cherry .

Increase by one and go to Step 2.

First note that it is easily checked that the construction is well defined, that is, it returns a sequence of ordered pairs. Moreover, in each iteration of the above construction, it follows from Lemma 2 that is tree child. We next show that, if is a cherry in , and and equate to and , respectively, where and are elements in , then exactly one of and is a reticulation in . To see this, if and are both reticulations in , then and are not vertex disjoint in ; a contradiction. On the other hand, suppose neither and are reticulations in . Without loss of generality, we may assume is the first such cherry for which this holds. Since is tree child, and therefore has no tree vertex that is the parent of two reticulations, there is an iteration , in which the cherry is concatenated with , where , and has as a cherry but does not. If or , we contradict the construction by the choice of . Also, if , then we again contradict the construction. Hence, we may assume for the remainder of the proof that exactly one of and is a reticulation in .
Let
be the sequence returned by the construction. We prove by induction on that is a treechild sequence for whose weight is at most . If , then consists of the single vertex in and the construction correctly returns such a sequence.
Now suppose that , and consider the first iteration of the construction. Either is a cherry or a reticulated cherry of . If is a cherry, then is a cherry of each tree in . In this instance, let denote the set of phylogenetic trees obtained from by picking , where . Observe that is a treechild network on that displays .
Now assume that is a reticulated cherry with as the reticulation leaf. Let be the subset of trees in not displayed by and let . Note that is a cherry of each tree in . For each tree in , delete the edge incident with , suppress any resulting degree2 vertex, and reattach to the rest of the tree containing by subdividing an edge with a new vertex and adding an edge joining this vertex and so that the resulting phylogenetic tree is displayed by . It is easily seen that this is always possible. Let denote the resulting collection of trees obtained from . For this instance, let and observe that displays . To complete the induction it suffices to show that if
is a treechild sequence for whose weight is at most , then is a treechild sequence for whose weight is at most , that is, at most .
First assume that is a cherry of . Then, as is a cherry of each tree in , it follows by induction that is a cherrypicking sequence of . Furthermore, as is tree child and , is also tree child. Since only appears once as the first coordinate of an ordered pair in , we have
Now assume that is a reticulated cherry of with as the reticulation leaf. Without loss of generality, let denote the associated reticulation, so that . Since each tree in has as a cherry and is a treechild sequence for , it follows that is a cherrypicking sequence for .
We next show that is tree child. If the indegree of is at least three in , then exists in and so, by construction, does not appear as the second coordinate of an ordered pair in as well as in . Therefore, if the indegree of is at least three in , then is a tree child.
Now suppose that the indegree of is two in . To establish that is tree child, assume to the contrary that appears as the second coordinate of an ordered pair in . Let denote the first such ordered pair. Then, at some iteration , either is a cherry or a reticulated cherry of . If is a cherry of , then, since , we are in Step 3(a) in iteration of the construction and so the ordered pair should be ; a contradiction. On the other hand, if is a reticulated cherry of , then is the reticulation leaf of and, by construction of , one of the parents of in is the parent of two reticulations in , namely and the reticulation for which, by construction, there is a tree path starting at this reticulation and ending at ; a contradiction as is tree child. Hence is tree child. Furthermore, as and ,
Comments
There are no comments yet.