Recovering tree-child networks from shortest inter-taxa distance information

11/24/2017 ∙ by Magnus Bordewich, et al. ∙ University of Canterbury University of East Anglia Durham University 0

Phylogenetic networks are a type of leaf-labelled, acyclic, directed graph used by biologists to represent the evolutionary history of species whose past includes reticulation events. A phylogenetic network is tree-child if each non-leaf vertex is the parent of a tree vertex or a leaf. Up to a certain equivalence, it has been recently shown that, under two different types of weightings, edge-weighted tree-child networks are determined by their collection of distances between each pair of taxa. However, the size of these collections can be exponential in the size of the taxa set. In this paper, we show that, if we ignore redundant edges, the same results are obtained with only a quadratic number of inter-taxa distances by using the shortest distance between each pair of taxa. The proofs are constructive and give cubic-time algorithms in the size of the taxa sets for building such weighted networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Distance-based methods collectively provide fundamental tools for the reconstruction and analysis of phylogenetic (evolutionary) trees. Two of the most popular and longstanding distance-based methods are UPGMA [14] and Neighbor Joining [11]. Loosely speaking, these methods take as input a distance matrix on a set of taxa, whose entries are the distances between pairs of taxa in , and return an edge-weighted phylogenetic tree on that best represents . Distances between taxa could, for example, measure the time since the two taxa separated from a common ancestor. Typically, the property underlying any distance-based method is the following: if is a phylogenetic tree whose edges are assigned a positive real-valued weight, then the pairwise distances between taxa is sufficient to determine and its edge weighting [6, 13, 20]. This property is frequently referred to as the Tree-Metric Theorem (see [12, Theorem 7.2.6]).

While there exist numerous distance-based methods for reconstructing and analysing phylogenetic trees that aim to explicitly represent the treelike evolution of species, there are only a small number of such methods for phylogenetic networks. Yet, phylogenetic networks are necessary to accurately represent the ancestral history of sets of taxa whose evolution includes non-treelike (reticulate) evolutionary processes such as hybridisation and lateral gene transfer. As a step towards developing practical distance-based methods for reconstructing and analysing phylogenetic networks, in this paper we are interested in establishing analogues of the Tree-Metric Theorem for edge-weighted phylogenetic networks. In particular, we establish two such analogues for the increasingly prominent class of tree-child networks. Briefly, we show that, under two types of weightings, edge-weighted tree-child networks on with no redundant edges can be reconstructed from the pairwise shortest distances between taxa in time polynomial in the size of . The first type of weighting induces an ultrametric, while the second type of weighting has the property that the pair of edges directed into a reticulation (i.e. a vertex of in-degree two) have equal weight. We envisage that these results could lead to practical algorithms to construct phylogenetic networks from distance data.

In related work, Chan et al. [8] take a matrix of inter-taxa distances and reconstruct an ultrametric galled network having the property that there is a path between each pair of taxa having the same length as that given in the matrix, if such a network exists. Willson [18]

studied the problem of determining a phylogenetic network given the average distance between each pair of taxa, where each reticulation assigns a probability to the two edges directed into it. It is shown in 

[18] that such distances are enough to determine a phylogenetic network with a single reticulation cycle in polynomial time. Bordewich and Semple [3], the original starting point for this paper, showed that (unweighted) tree-child networks can be reconstructed from the multi-set of distances between taxa in polynomial time. Other methods have been developed for building phylogenetic networks from distance data (see, for example, [16, 19]) but these use different approaches to associate a distance to a network than the ones presented here. In addition, Huber and Scholz [9] considered the problem of reconstructing phylogenetic networks from so-called symbolic distances, and it would be interesting to see if our new results could extend to such distances. Two other papers (specificially [4, 5]) that are particularly relevant to the work of our paper, are discussed in more detail in the next section.

Although distance based methods may be considered not as accurate as other methods such as Maximum Likelihood, they still have an important role in studying large datasets and gaining quick results when exploring data. This role may even be more significant for phylogenetic networks, where the complexity of inferring optimal solutions is even higher than for phylogenetic trees. The next section details some background and gives the statements of the two main results.

2. Main Results

Throughout the paper, denotes a non-empty finite set. A phylogenetic network on is a rooted acyclic directed graph with no parallel edges satisfying the following:

  1. the unique root has out-degree two;

  2. the set is the set of vertices of out-degree zero, each of which has in-degree one; and

  3. all other vertices either have in-degree one and out-degree two, or in-degree two and out-degree one.

If , then we additionally allow the directed graph consisting of the single vertex in to be a phylogenetic network. The vertices in are called leaves, while the vertices of in-degree one and out-degree two are tree vertices, and the vertices of in-degree two and out-degree one are reticulations. An edge directed into a reticulation is a reticulation edge; all other edges are tree edges. An element in is an outgroup if its parent is the root of . Furthermore, an edge is a shortcut if there is a directed path in from to avoiding . Note that, necessarily, a shortcut is a reticulation edge. In the literature, shortcuts are also known as redundant edges. Ignoring the weighting of the edges, Fig. 1(i) shows a phylogenetic network with root , outgroup , and . It has exactly two reticulations, but no shortcuts.

(i) (ii)
Figure 1. Two (weighted) phylogenetic networks on .

Let be a phylogenetic network. We say is tree-child if each non-leaf vertex in is the parent of either a tree vertex or a leaf. Moreover, is normal if it is tree-child and has no shortcuts. For example, the phylogenetic network in Fig. 1(i) is normal. As with all figures in this paper, edges are directed down the page. Tree-child networks and normal networks were introduced by Cardona et al. [7] and Willson [15, 17], respectively.

Let be a phylogenetic network on , and let denote the edge set of . A weighting of is a function that assigns edges a non-negative real-valued weight such that tree edges are assigned a strictly positive weight. Thus reticulation edges are allowed weight zero. Let be a weighted phylogenetic network, and let and be vertices in . An up-down path from to is an underlying path

where, for some ,

and

are edges in . The length of is the sum of the weights of the edges in and, provide and are distinct, is the peak of . In Fig. 1(i), there are exactly two up-down paths joining and . One path has weight , while the other path has weight . As in this example, in this paper we are interested in the inter-taxa distances in , that is, the lengths of the up-down paths connecting leaves in . We next state two recent results [5, 4]. Both results prompted the two main results in this paper.

Let be a weighted phylogenetic network on and, for all , let be the set of up-down paths from to in . The multi-set of distances from to , denoted , is the multi-set of the lengths of the paths in . Similarly, the set of distances from to , denoted , is the set of the lengths of the paths in . Note that , , , and for all . The multi-set distance matrix of is the matrix whose -th entry is (respectively, the set distance matrix of is the matrix whose -th entry is ) for all , in which case we say (resp. ) is realised by .

Let be a weighted phylogenetic network on and let be the multi-set distance matrix of . The underlying problem we are investigating is determining how much says about . The following highlights that, even with tree edges having positive weights, the weighting of cannot be determined exactly. Let be a reticulation in with parents and , and let be the unique child of . We can change the weighting of the edges incident with without changing . In particular, provided the sum of the weights of and equal

and the sum of the weights of and equal

we can change the weights of , , and (keeping the weights of all other edges the same) to construct a different weighting, say, so that is realised by . We refer to this change as re-weighting the edges at a reticulation of . For example, in Fig. 1, has been obtained from by reweighting the edges at both reticulations. If has an outgroup , a similar occurrence happens with the two edges incident with the root of . In particular, now let be the child of that is not . Note that is a tree vertex. Then, provided the weights of and sum to

we can change the weights of and (keeping the weights of all other edges the same) to construct a different weighting, say, so that is realised by . We refer to this change as re-weighting the edges at the root of .

Let and be two weighted phylogenetic networks on . We say and are equivalent if is isomorphic to , and can be obtained from by re-weighting the edges at each reticulation and re-weighting the edges at the root. For example, the weighted phylogenetic networks in Fig. 1 are equivalent.

Figure 2. A phylogenetic network with an equidistant weighting.

We now state the two recent results. A weighting of a phylogenetic network is equidistant if the length of all paths starting at the root and ending at a leaf are the same length. While a weighting of a phylogenetic network is a reticulation-pair weighting if, for each reticulation in , the two edges directed into have the same weight. The weightings of the phylogenetic networks shown in Fig. 1 are reticulation-pair weightings. The weighting of the phylogenetic network shown in Fig. 2 is an equidistant weighting. The following theorems are established in [5] and [4], respectively.

Theorem 2.1.

Let be the set distance matrix of an equidistant-weighted tree-child network on . Then, up to equivalence, is the unique such network realising , in which case a member of its equivalence class can be found from in time.

Theorem 2.2.

Let be the multi-set distance matrix with distinguished element of a reticulation-pair weighted tree-child network on with outgroup . Then, up to equivalence, is the unique such network realising , in which case a member of its equivalence class can be found from in time .

Theorems 2.1 and 2.2 are somewhat surprising given that, in the size of the leaf set, it is possible to have exponentially-many up-down paths connecting leaves in a tree-child network. For example, Fig. 3 shows a normal network with leaves in which there are distinct up-down paths connecting and . Nevertheless, the fact that all such paths are considered is not satisfactory. The next two theorems are the two main results of this paper. They show that, up to shortcuts, we can retain the outcomes of Theorems 2.1 and 2.2 by knowing only a quadratic number of inter-taxa distances.

Figure 3. A normal network on . There are distinct up-down paths connecting and .

Let be a weighted phylogenetic network on . The minimum distance matrix of is the matrix whose -th entry is the minimum length of an up-down path joining and for all . We denote the -th entry in by .

Theorem 2.3.

Let be the minimum distance matrix of an equidistant-weighted normal network on . Then, up to equivalence, is the unique such network realising , in which case a member of its equivalence class can be found from in time.

Now let be a weighted phylogenetic network on , where is an outgroup. For the purposes of the next theorem, the minimum distance matrix of is the matrix whose -th entry is for all . Furthermore, the maximum distance outgroup vector, denoted

, is the vector of length

whose -th coordinate is the maximum length of an up-down path joining and for all . We denote the -th coordinate in by .

Theorem 2.4.

Let and be the minimum distance matrix and maximum distance outgroup vector of a reticulation-pair weighted normal network on , where is an outgroup. Then, up to equivalence, is the unique such network realising and , in which case a member of its equivalence class can be found from and in time.

It is easily seen that it is not possible to extend Theorems 2.3 and 2.4 to tree-child networks as the distance information given in the hypothesis of these theorems is insufficient to determine shortcuts. However, these results do hold for temporal tree-child networks as such networks have no shortcuts and are therefore normal. For the definition of temporal phylogenetic network, see [1].

The rest of the paper is organised as follows. The next section contains some necessary preliminaries. In particular, it contains the constructions of the three operations that underlie the inductive proofs of Theorems 2.3 and 2.4. These proofs, including explicit descriptions of the associated algorithms, are given in Sections 4 and 5, respectively.

3. Preliminaries

Let be a phylogenetic network on and let be a -element subset of . Denote the unique parents of and by and , respectively. We call a cherry if , that is, the parents of and are the same. Furthermore, we call a reticulated cherry if either or , say , is a reticulation and is a parent of , in which case is the reticulation leaf of the reticulated cherry. Observe that is necessarily a tree vertex. To illustrate, in Fig. 1(i), is a cherry, while is a reticulated cherry in which is the reticulation leaf. In the same figure, is also a reticulated cherry. The next lemma is well known for tree-child networks (for example, see [3]). The restriction to normal networks is immediate. We will used it freely throughout the paper.

Lemma 3.1.

Let be a normal network on . Then

  1. If , then consists of the single vertex in , while if , say , then consists of the cherry .

  2. If , then has either a cherry or a reticulated cherry.

In addition to using the last lemma freely, we will also use freely the following observation. If is a normal network and is a vertex of , then there is a directed path from to a leaf avoiding reticulations except perhaps .

Let be a weighted phylogenetic network on . We now describe three operations on . The first and second operations underlie the inductive approach we take to prove Theorem 2.3, while the first and third operations underlie the inductive approach we take to prove Theorem 2.4. Let be -element subset of , and denote the parents of and by and , respectively. First suppose that is a cherry. Let denote the parent of . Then reducing is the operation of deleting and its incident edge, suppressing , and setting the weight of the resulting edge to be

Now suppose that is a reticulated cherry, in which is the reticulation leaf. Let denote the parent of , and let denote the parent of that is not . Then cutting is the operation of deleting , suppressing and , and setting the weight of the resulting edge to be

and the edge to be

Lastly, if is a tree vertex, then isolating is the operation of deleting , suppressing and , and setting the weight of the resulting edge to be

and the edge to be

where is the parent of and is the child of that is not . To illustrate the last two operations, consider Fig. 4. The weighted phylogenetic network has been obtained from the weighted phylogenetic network in Fig. 2 by cutting , while has been obtained from the same network by isolating .

Figure 4. The weighted phylogenetic networks and obtained from the weighted phylogenetic network in Fig. 2 by cutting and isolating , respectively.

The proof of the next lemma is straightforward and omitted.

Lemma 3.2.

Let be a weighted normal network. Let be a cherry or a reticulated cherry of . If is obtained from by reducing if is a cherry, or by cutting or isolating if is a reticulated cherry, then is a normal network. Furthermore, if is an equidistant (resp. reticulation-pair) weighting, then is an equidistant (resp. reticulation-pair) weighting.

We next describe three operations on distance matrices that parallel the operations of reducing, cutting, and isolating. Let be a distance matrix on with each entry consisting of a single value, that is, is an matrix whose -th entry is denoted . Let be a -element subset of . Although the choice of appears arbitrary, in the sections that follow, will always be a cherry in the case of the first operation or a reticulated cherry in which is the reticulation leaf in the case of the second and third operations. Furthermore, in the sections that follow, the operations will always be well defined. First, let be the distance matrix on obtained from by setting

for all . We say that has been obtained from by reducing in .

For the second operation, let

and let . Furthermore, let

Intuitively, are those leaves whose shortest path to does not go through the parent of , and are those members of at minimum distance from . Now let be the distance matrix on obtained from by setting

for all , setting , and setting

if , and

otherwise for all . We say that has been obtained from by cutting in . Intuitively, elements of are being used as a proxy for in determining the minimum distances from to members of in ; the distance from to any member of remains .

For the last operation, let be a distinguished element in , and let

Intuitively, is the difference in length of the edge from the parent of to and the path from the parent of to . Let be the distance matrix on obtained from by setting

for all ,

for all , and

We say that has been obtained from by isolating in .

4. Proof of Theorem 2.3

In this section, we establish Theorem 2.3. We begin with two lemmas.

Lemma 4.1.

Let be a equidistant-weighted normal network on , where . Let be the minimum distance matrix of , and let be a -element subset of such that

Then is either a cherry or a reticulated cherry of . Moreover,

  1. The set is a cherry of if for all .

  2. Otherwise, is a reticulated cherry of in which is the reticulation leaf precisely if for some .

Proof.

Let and denote the parents of and , respectively. If and are both reticulations, then, as is normal and equidistant, it is easily seen that there is an element such that either or ; a contradiction ( is a descendant of a tree vertex on the shortest up-down path from to that is not the peak of that up-down path). Thus, either or is a tree vertex. Without loss of generality, we may assume that is a tree vertex. Let denote the child of that is not . If is a tree vertex, then, as is equidistant,

where is a cherry or reticulated cherry and are descendants of ; a contradiction. Therefore either is a leaf or a reticulation. If is a leaf, then, as is equidistant, , in which case, is a cherry. If is a reticulation, then, as is normal and

it follows that the unique child of is , and so is a reticulated cherry. Since has no shortcuts and is equidistant, it is easily checked that is a reticulated cherry in which is the reticulation leaf if and only if for some . The lemma immediately follows. ∎

Lemma 4.2.

Let be an equidistant-weighted normal network on , where . Let be the minimum distance matrix of , and let be a -element subset of such that

Then the distance matrix obtained from by reducing , if is a cherry, and by cutting , if is a reticulated cherry in which is the reticulation leaf, is the minimum distance matrix realised by the weighted network obtained from by reducing , if is a cherry, and cutting , if is a reticulated cherry in which is the reticulation leaf.

Proof.

By Lemma 4.1, is either a cherry or a reticulated cherry. Furthermore, by the same lemma, if is a reticulated cherry, we may assume, without loss of generality, that is the reticulation leaf. Also, by Lemma 3.2, is an equidistant-weighted normal network regardless of which of the two stated ways it is obtained from . If is a cherry, then it is clear that the distance matrix obtained from by reducing is the minimum distance matrix of . Therefore, suppose that is a reticulated cherry, in which case is obtained from by cutting . Let be the distance matrix obtained from by cutting . We will show that is the minimum distance matrix of .

Let and denote the parents of and , respectively, in . Since the only up-down paths in joining elements in that traverse the edge involve , it follows that for all . Thus, to complete the proof, it suffices to show that for all .

Let denote the parent of that is not . Since has no shortcuts, is not an ancestor of . Therefore, as is normal, there is a (directed) path from to a leaf, say, containing no reticulations, where . Note that, in what follows, we never determine but its existence underlies the rest of the proof.

Let

Thus, if , then every minimum length up-down path in joining and must traverse the edge . Observe that as the edge directed into , which is a tree edge, has positive weight and every up-down path from to traverses this edge and therefore , so . Now let

and let denote the subset of consisting of those elements such that , that is,

Again, observe that , and so the elements in are descendants of .

Let . We next determine whether or not is a descendant of . First note that if and only if is the parent of a leaf, in which case is the only leaf apart from that is a descendant of . So assume . We establish two claims:

  1. If

    then is not a descendant of .

  2. If

    then is a descendant of .

To see (i) and (ii), if is a descendant of , then for all , so

On the other hand, if is not a descendant of , then

It follows from (i) and (ii) that, for all , if is not a descendant of , then

otherwise

Hence for all , thereby completing the proof of the lemma. ∎

We next prove the uniqueness part of Theorem 2.3.

Proof of the uniqueness part of Theorem 2.3.

The proof is by induction on the sum of the number of leaves and the number of reticulations in . If , then and , and consists of the single vertex in , and so uniqueness holds. If , then, as is normal, and , and consists of two leaves attached to the root. Again, uniqueness holds as is equidistant and so the weights of the edges incident with the leaves is fixed. Now suppose that , so , and that the uniqueness holds for all equidistant-weighted normal networks for which the sum of the number of leaves and the number of reticulations is at most .

Let be a -element subset of such that

By Lemma 4.1, is either a cherry or a reticulated cherry of . If is a reticulated cherry, then, by the same lemma, we can determine from which of and is the reticulation leaf. Thus, without loss of generality, we may assume that is the reticulation leaf. Depending on whether is a cherry or a reticulated cherry, let and be the weighted network and distance matrix obtained from and by reducing or cutting , respectively. By Lemma 4.2, is the minimum distance matrix of . Since either has leaves and reticulations if is a cherry, or leaves and reticulations if is a reticulated cherry, it follows by the induction assumption that, up to equivalence, is the unique equidistant-weighted normal network with minimum distance matrix .

Let be an equidistant-weighted normal network on with minimum distance matrix . By Lemma 4.1, is either a cherry or a reticulated cherry in . Indeed, by the same lemma, is a cherry in precisely if it is a cherry in . First assume that is a cherry in . Then is a cherry in . Let be the equidistant-weighted normal network obtained from by reducing . By Lemma 4.2, is the minimum distance matrix of and so, by the induction assumption, and are equivalent. Using this equivalence and considering , it is easily seen that and are equivalent.

Now assume that is a reticulated cherry in . Then is a reticulated cherry in where, by Lemma 4.1, is a reticulation leaf. Let be the equidistant-weighted normal network obtained from by cutting . Since is the minimum distance matrix of , it follows by Lemma 4.2 that is the minimum distance matrix of . Therefore, by the induction assumption, and are equivalent. By again considering , it is now easily deduced that and are equivalent. This completes the proof of the uniqueness part of the theorem. ∎

4.1. The Algorithm

Let be an equidistant-weighted normal network on and let denote the minimum distance matrix of . We next give an algorithm which takes as input and , and returns a weighted network equivalent to . Its correctness is essentially established in proving the uniqueness part of Theorem 2.3 and so a formal proof of this is omitted. However, its running time is given at the end of this section. Called Equidistant Normal, the algorithm works as follows.

  1. If , then return the weighted phylogenetic network consisting of the single vertex in .

  2. If , say , then return the weighted phylogenetic network with exactly two leaves and adjoined to the root by edges each with weight .

  3. Else, find a -element subset of such that

    1. If for all (so is a cherry), then

      1. Reduce in to give the distance matrix on .

      2. Apply Equidistant Normal to input and . Construct from the returned weighted phylogenetic network on as follows. If is the parent of in , then subdivide with a new vertex , adjoin a new leaf to via the new edge , and set

        and . Keeping all other edge weights the same as their counterparts in , return .

    2. Else ( is a reticulated cherry, in which case, is the reticulation leaf if there exists an such that ),

      1. Cut in to give the distance matrix on .

      2. Apply Equidistant Normal to input and . Construct from the returned weighted phylogenetic network on as follows. If and denote the parents of and in , respectively, then subdivide and with new vertices and , respectively, adjoin and via the new edge , set and , and, for some positive real value such that and , set , , and