Identifiability of species network topologies from genomic sequences using the logDet distance

08/03/2021 ∙ by Elizabeth S. Allman, et al. ∙ University of Alaska Dalhousie University 0

Inference of network-like evolutionary relationships between species from genomic data must address the interwoven signals from both gene flow and incomplete lineage sorting. The heavy computational demands of standard approaches to this problem severely limit the size of datasets that may be analyzed, in both the number of species and the number of genetic loci. Here we provide a theoretical pointer to more efficient methods, by showing that logDet distances computed from genomic-scale sequences retain sufficient information to recover network relationships in the level-1 ultrametric case. This result is obtained under the Network Multispecies Coalescent model combined with a mixture of General Time-Reversible sequence evolution models across individual gene trees, but does not depend on partitioning sequences by genes. Thus under standard stochastic models statistically justifiable inference of network relationships from sequences can be accomplished without consideration of individual genes or gene trees.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

As genomic-scale sequencing has become increasingly common, attention in phylogenetics has shifted from inferring trees of evolutionary relationships for individual genetic loci from a set of species to inferring relationships between the species themselves. A substantial complication is that population genetic processes within species, as modeled by the Multispecies Coalescent (MSC) model can lead to individual gene trees having quite different topological structures than the tree relating the species overall. If the evolutionary history of the species also involved hybridization or other forms of horizontal gene flow, so that a species network is a more suitable depiction of relationships, the relationships of gene trees to the network, as modeled by the Network Multispecies Coalescent (NMSC) model, is even more complex.

Inference of species networks, through a combined NMSC and sequence substitution model, can be performed in a Bayesian framework Zhang et al. (2017); Wen and Nakhleh (2018) but computational demands severely limit both the number of taxa and the number of genetic loci considered. Other methods take a faster two-stage approach, first inferring gene trees which are treated as “data” for a second inference of a species network. Approaches include maximum pseudolikelihood using either rooted triples (PhyloNet) or quartets (SNaQ) displayed on the gene trees Yu and Nakhleh (2015); Solís-Lemus and Ané (2016), or the faster, distance-based analysis built on gene quartets of NANUQ Allman et al. (2019a). Still, the first stage of these approaches, the inference of individual gene trees, can be a major computational burden. Avoiding such gene tree inference, and passing more directly from sequences to an inferred network, could substantially reduce total computational time in data analysis pipelines.

The goal of this paper is to show that most topological features of a level-1 species network can be identified from logDet intertaxon distances computed from aligned genomic-scale sequences. In particular this can be done without partitioning the sequences by genes, under a combined model of the NMSC and a mixture of general time-reversible (GTR) substitution processes on gene trees. While the main result, that the logDet distances retain enough information to recover most of the species network, despite having lost information on individual genes, is a theoretical one, it points the way toward faster algorithms for practical inference. In particular, since the computation of logDet distances requires little effort, it suggests that a distance-based approach similar to NANUQ’s, but avoiding individual gene tree inference, may offer substantially faster analyses than current methods.

The model of sequence evolution underlying our result accounts not only for base substitutions along each gene tree, but also for variation in gene trees due to their formation under a coalescent process combined with hybridization or similar gene transfer. Our model extends to networks the mixture of coalescent mixtures model on species trees of Allman et al. (2019b), which itself extended the coalescent mixture introduced by Chifman and Kubatko Chifman and Kubatko (2015). More specifically, for a fixed species network, gene trees are formed under the Network Multispecies Coalescent model Meng and Kubatko (2009); Yu et al. (2011); Zhu et al. (2016) for each site independently. GTR substitution parameters for base evolution on each site’s tree are then independently chosen from some distribution, leading to a site pattern distribution. These site distributions are finally combined to give a site pattern distribution for genomic sequences. (As discussed in Section 2, this distribution also applies to a more realistic model in which multisite genes with a single substitution process have lengths chosen independently from some distribution.) While this pattern frequency distribution thus reflects the substitution processes on all the gene trees, information about pattern frequencies arising on any individual gene tree is hidden.

The logDet distance was first introduced in the context of a single class general Markov model of sequence evolution on a single gene tree

Steel (1994); Lockhart et al. (1994), and has been used both to obtain both gene tree identifiability results and for inference of individual gene trees. Considering genomic sequences, Liu and Edwards Liu and Edwards (2009), and independently Dasarathy et al. Dasarathy et al. (2015), showed that for a Jukes-Cantor substitution model and an ultrametric species tree, the Jukes-Cantor distances obtained under the coalescent mixture model still allowed for consistent inference of topological species trees. By passing to the logDet distance, Allman et al. Allman et al. (2019b) extended this result to the more realistic mixture of coalescent mixtures model, showing that the logDet distance allowed for consistent inference of a topological species tree, assuming it is ultrametric in generations. This study builds on all these works on gene and species tree models, but considers level-1 species networks on which all extant species are equidistant from the root.

Passing from species trees to networks is a substantial step, however, and our approach is strongly motivated by the approach taken by Baños Baños (2019)

in studying identifiability of features of unrooted level-1 topological species networks from gene tree quartet concordance factors (probabilities of the different quartet topologies displayed on gene trees). In the ultrametric setting of this work, we show that logDet distances computed from genomic sequences suffice to determine 4-cycles on undirected rooted triple networks, and then that this 4-cycle information for different rooted triples can be combined to determine all cycles of size 4 or more, and even all hybrid nodes in those cycles of size 5 or more. We do not obtain information on 2- or 3- cycles, so our results closely parallel those in

Baños (2019), despite the rather different source of information.

There are a number of other theoretical works in the literature on determining phylogenetic networks from limited information. For instance, Jansson and Sung (2006) investigates determining a level-1 network from the rooted triple trees it displays, Huber et al. (2017, 2018) discuss how knowledge of trinets (induced 3-taxon directed rooted networks) and quarnets (induced 4-taxon undirected unrooted networks) determine larger networks, and van Iersel et al. (2020) explores determination of networks from distances. However, the question of how, or whether, these results can be applied to biological data is not addressed, and the setting of these works is not directly applicable to obtaining our results.

Other works Gross and Long (2018); Gross et al. (2020); Hollering and Sullivant (2021) use algebraic approaches to show that certain types of level-1 networks can be identified from joint pattern frequency arrays under group-based models of sequence evolution such as the Jukes-Cantor and Kimura models. In addition to their restriction on sequence evolution models, these works do not incorporate a coalescent process. That is, all sequence sites are assumed to have evolved on one of the finitely-many trees displayed on the network. Since the absence of a coalescent process is a limiting case of our coalescent-based model, our results allowing for mixtures of more general sequence evolution models extend those results in the ultrametric case. Algebraic study of a network model combined with the general Markov model, again with no coalescent process, was also conducted in Casanellas and Fernández-Sánchez (2020).

This paper proceeds as follows. Section 2 defines the networks and models under consideration, as well as the logDet distance. Section 3 uses combinatorial arguments to show how information on undirected rooted triple networks can be used to determine features of a larger directed network from which they are induced. Expected frequencies of site patterns for sequences produced by the mixture of coalescent mixtures model are studied in Section 4, and shown to be expressible as convex combinations of pattern frequencies from simpler networks. In Section 5 we show that the ordering by magnitude of logDet distances for triples of taxa tells us about the induced rooted triple species network, and by combining this with the result of Section 3 we obtain our main identifiability result, Theorem 5.1. Section 6 further studies the logDet distances from a rooted triple network, in order to better understand what triples of distances can arise under the mixture of coalescent mixtures model. We conclude in Section 7 with an outline of how these results can be developed into a practical inference algorithm.

2 Networks and models

2.1 Phylogenetic Networks

Although there are many variations on the notion of a phylogenetic network in the literature, we adopt ones appropriate to the Network Multispecies Coalescent (NMSC) model. This model, which describes the formation of trees of gene lineages in the presence of both incomplete lineage sorting and hybridization, will be further developed in the next subsection. First, we focus on setting forth combinatorial aspects of the networks.

Definition 1

Solís-Lemus and Ané (2016); Baños (2019) A topological binary rooted phylogenetic network on taxon set is a connected directed acyclic graph with vertices and edges , where is a disjoint union and is a disjoint union , with a bijective leaf-labeling function with the following characteristics:

  • The root has indegree 0 and outdegree 2.

  • A leaf has indegree 1 and outdegree 0.

  • A tree node has indegree 1 and outdegree 2.

  • A hybrid node has indegree 2 and outdegree 1.

  • A hybrid edge is an edge whose child node is hybrid.

  • A tree edge is an edge whose child node is either a tree node or a leaf.

When or , we refer to as a rooted triple network or a rooted quartet network, respectively.

The vertices, and edges, of are partially ordered by the directedness of the graph. For instance, a node is below a node , and is above , if there exists a non-empty directed path in from to . The root is thus above all other nodes.

A metric notion of the network above incorporates some of the parameters of the NMSC model. This introduces edge lengths, measured in generations throughout this article, as well as probabilities that a gene lineage at a hybrid node follows one or the other hybrid edge as it traces back in time toward the network root. Since we focus on binary networks, only hybrid edges are allowed to have length 0, to model possibly instantaneous jumping of a lineage from one population to another.

Definition 2

A metric binary rooted phylogenetic network is a topological binary rooted phylogenetic network together with an assignment of weights or lengths to all edges and hybridization parameters to all hybrid edges, subject to the following restrictions:

  • The length of a tree edge is positive.

  • The length of a hybrid edge is non-negative.

  • The hybridization parameters and for a pair of hybrid edges with the same child hybrid node are positive and sum to 1.

A metric network of this sort is said to be ultrametric if every directed path from the root to a leaf has the same total length. This is equivalent to requiring the ultrametricity of all trees displayed on the network. An example of a simple ultrametric network is shown in Figure 1 (Right).

Figure 1: (Left) An ultrametric species network with time in generations before the present, hybrid edges and shown in red, and population functions on each edge depicted by widths of “tubes.” The edge lengths are measured on the -axis between the dashed lines indicating speciation and hybridization events. The dashed red/blue boundary represents a hybrid node, the top dashed line the root of the network, and other dashed lines tree nodes. (Right) A schematic of the same species tree, which does not show population sizes. Hybridization parameters and are omitted from both drawings.

On directed networks there are several analogs Steel (2016) of the most recent common ancestor of a set of taxa on a tree. The following is the most useful in this work.

Definition 3

Steel (2016) Let be a (metric or topological) binary rooted phylogenetic network on a set of taxa and let . Let be the set of nodes which lie on every directed path from the root of to any . Then the lowest stable ancestor of on , denoted , is the unique node such that is below all with . The lowest stable ancestor (LSA) of a network on is .

Phylogenetic networks as defined here have no cycles in the usual sense for a directed graph. The term cycle will thus be used to refer to a collection of edges that form a cycle when all edges are undirected. A cycle must contain at least two hybrid edges sharing a hybrid node, and may contain any non-negative number of tree edges. The class of networks we focus on is those in which cycles are separated, in the following sense.

Definition 4

A rooted binary phylogenetic network is said to be level-1 if no two distinct cycles in share an edge.

Although this is not the standard definition of level-1 Rosselló and Valiente (2009), in the setting of binary networks it is equivalent.

Each cycle on a level-1 phylogenetic network contains exactly one hybrid node and two hybrid edges with that node as a child. Thus there is a one-to-one correspondence between cycles and the hybrid nodes they contain. A cycle composed of edges, 2 of which are hybrid, is called an -cycle. If the cycle’s hybrid node has leaf descendants, it is an -cycle.

Passing from a large network to one on a subset of the taxa is similar to the process for trees.

Definition 5

Suppressing a node with both in- and out-degree 1 in a directed phylogenetic network means replacing it and its two incident edges with a single edge from its parent to its child. For a metric network, the new edge is assigned a length equal to the sum of lengths of the two replaced. If the outedge was hybrid, the new edge is also hybrid and retains the hybridization parameter.

Similarly, suppressing a node of degree 2 between two undirected edges means replacing it and its two incident edges with a single undirected edge.

Definition 6

Let be a (metric or topological) binary rooted phylogenetic network on and let . The induced rooted network on is the network obtained from by retaining nodes and edges in every path from the root on to any , and then suppressing all nodes with in- and out-degree 1. We then say displays .

We need the notion of a rooted undirected network, in which all edges have been undirected but the root retained. Note that if a rooted network is a tree, knowledge of the root alone is enough to recover the direction of every edge, so this notion is not useful in that setting. If cycles are present, knowledge of the root determines only the direction of every cut edge (an edge whose deletion results in a graph with two connected components), and edges directly descended from cut edges. Knowing the root and all hybrid nodes in an undirected level-1 network does, however, determine the full directed network.

Several other notions of networks induced from a directed one are needed.

Definition 7

Let be a (metric or topological) binary rooted phylogenetic network on .

  1. Baños (2019) The LSA network induced from is the network on obtained by deleting all edges and nodes above , and designating as the root node.

  2. The undirected LSA network is the rooted network obtained from the LSA network by undirecting all edges.

  3. Baños (2019) The unrooted semidirected network is the unrooted network obtained from the LSA network by undirecting all tree edges and suppressing the root, but retaining directions of hybrid edges.

For a binary level-1 network , the only possible structure above the LSA has the form of a (possibly empty) chain of 2-cycles Baños (2019), an example of which is shown in Figure 2. The LSA network is obtained by simply deleting that chain.

Figure 2: A rooted network whose LSA network is the rooted tree , but which has a chain of 2-cycles above .

Note that the terminology of “-cycles” can be applied to LSA networks , as hybrid edges retain their direction. On undirected LSA networks , however, “-cycle” can still be applied, but “-cycle” generally cannot.

Definition 8

By suppressing a cycle in a topological level-1 network we mean deleting all edges in , identifying all nodes in , and if the resulting node is of degree 2 suppressing it. If the network is rooted and this results in the root becoming a degree 1-node, then the resulting edge below the root is also deleted, with its child becoming the root.

Suppressing an -cycle in a binary level-1 network results in a non-binary network when . However if only 2- and 3-cycles are suppressed, the result is binary.

2.2 Coalescent Model on Networks

The formation of gene trees within a species network, as ancestral lineages of sampled loci from extant taxa join together moving backwards in time, is given a mechanistic description by the Network Multispecies Coalescent Model (NMSC) Meng and Kubatko (2009); Yu et al. (2011); Zhu et al. (2016).

Parameters of the NMSC for a set of taxa include a metric rooted binary phylogenetic network on , with edge lengths in generations. In addition, for each edge fix a function giving the (haploid) population size along the edge, where is the population size at the child node and is the population at time units above it. Finally, let be an additional population size function for an infinite length ‘edge’ ancestral to the root of the network. The need not be constant nor equal, although those are common assumptions in other works. As in Allman et al. (2019b) we make the biologically-plausible technical assumptions that the functions are bounded, and that all are integrable over finite intervals.

Figure 1 (Left) depicts an example species network that is ultrametric in generations, with hybrid edges and , and population functions on each edge depicted by time-varying widths of the network edges. The edge lengths are measured on the -axis between the horizontal lines indicating speciation and hybridization events. Figure 1 (Right) gives a schematic of the same species tree, without a depiction of population functions.

The standard Kingman coalescent models the formation of gene trees, with edge lengths in generations, within a single population edge , with pairs of lineages coalescing independently as they trace backward in time, at instantaneous rate . The multispecies coalescent model (MSC) extends this to a tree of populations, by using the standard coalescent on each edge, as well as an infinite length edge above the root, allowing multiple gene lineages to enter a population from its descendant ones at a tree node. The NMSC extends this further, so that lineages reaching hybrid nodes randomly enter one or the other hybrid edge above them, with the choice determined independently according to the hybridization parameter probabilities. Thus the NMSC parameters and determine a distribution of rooted metric gene trees. The structure of the NMSC also ensures that the distributions of gene trees obtained by marginalization to a subset of taxa are the same as the distributions obtained from the NMSC on the displayed network .

2.3 Sequence substitution models on gene trees

The -state general time-reversible model (GTR) for sequence evolution is a continuous-time Markov process on a metric gene tree. Gene tree edge lengths are in substitution units, and sequences are composed of possible states, or bases. Model parameters are a instantaneous rate matrix together with a -state distribution , with non-negative entries summing to 1, satisfying the following:

  1. off-diagonals entries of are positive,

  2. row sums of are 0,

  3. ,

  4. is symmetric.

In the ultrametric framework for our species networks, we introduce an additional time-dependent but lineage-independent rate scalar for , where is measured in generations from leaves to the root and beyond, and has units of substitutions/generation. We assume is piecewise-continuous, for all so that the mutations process never stops, and so that the total amount of possible mutation is unbounded. Following Allman et al. (2019b), this substitution model is denoted by GTR+.

For any node on a gene tree, let denote the distance, in generations, to that node from its descendant leaves. The states at a single site in sequences at the taxa at the leaves on the gene tree are then determined as follows: A state is randomly chosen at the root of the tree from the distribution . For each edge descendant from a node the site undergoes random state changes with rates for times

to obtain states at the child nodes. The full substitution process on the edge is thus described by the Markov matrix

A similar process is then repeated for those nodes’ children, and so on, until states at the taxa have been determined.

2.4 Mixture of coalescent mixtures

The model we focus on is the -class mixture of coalescent mixtures Allman et al. (2019b) on an ultrametric network. This model has as parameters an ultrametric species network , population size functions , a finite collection of GTR+ parameters for the

classes, and a vector

of positive class size parameters summing to 1.

Sequence data is generated as follows: For each site:

  1. a gene tree is sampled according to the NMSC model on with population sizes ,

  2. class is sampled from the distribution to determine parameters ,

  3. the bases for each are sampled under the GTR+ process on with parameters .

This model is denoted by where

Sampling independent sites from this model produces -state aligned sequences of length . As usual in phylogenetics, these are summarized through counts of site patterns across the sequences in an -dimensional array. Marginalizations of this array to 2-dimensions give pairwise site pattern count matrices that compare only the sequences for two taxa in .

In the tree context, two extensions of this model were discussed in Allman et al. (2019b). For the first, the model assumption of one independently drawn gene tree for each site is modified to a more realistic one for genomic sequences in which all sites for a genetic locus share a gene tree. If the lengths (in number of sites) of the loci are independent identically distributed draws from some distribution, then the expected site pattern distribution for such a model is unchanged from that determined by . Only the rate of convergence, as the number of sampled genes grows, of frequencies of sampled site patterns to the asymptotic distribution will be slowed. Although the identifiability results of this paper are only formally stated for the model on networks, they apply more generally to a similarly extended network model.

Another extension in the tree setting in Allman et al. (2019b) allowed for relaxing the ultrametric condition while retaining strong results on identifiability from the logDet distances. In that extension, the scalar rate function was allowed to be edge dependent as long as a certain symmetry condition on mixture components resulted in ultrametricity in substitution units “on average” across gene trees. While a similar model extension in the network setting seems likely to lead to similar results, it is not explored here, as the technical complications are greater than in the tree case.

2.5 LogDet distance

The fundamental tool we use to study relationships of taxa under the mixture of coalescent mixtures model is the logDet distance between a pair of aligned sequences. It is computed as follows: For taxa , let be a matrix of empirical relative site-pattern frequencies, obtained by normalizing the site pattern count matrix for and , so that its entries sum to 1. Thus the entry of is the proportion of sites in the sequences exhibiting base for and base for . With and the vectors of row and column sums of , which give the proportions of various bases in the sequences for and , let and the products of the entries of , , respectively. Then the empirical logDet distance is

(1)

Under most phylogenetic models, including the mixture of coalescent mixtures model, individual site patterns in sequences are assumed to be independent and identically distributed. By the weak law of large numbers,

computed from a sample will converge in probability to its expected value as the sequence length goes to . By the continuous function theorem (e.g., van der Vaart (1998)), the empirical logDet distance thus converges in probability to the logDet distance computed by the same formula from the expected , a quantity we refer to as the theoretical logDet distance and denote by .

3 Rooted Networks from Undirected Rooted Triple Networks

The goal of this section is to establish Proposition 1, a combinatorial result indicating features of a topological level-1 rooted -taxon network that can be recovered from its induced undirected rooted triple networks with 2- and 3-cycles suppressed. This is a rooted analog of a key result of Baños (2019) relating unrooted semidirected networks and their induced undirected quartet networks. Later sections of this paper focus on identifying these rooted triple networks under the model .

There are several possible routes to Proposition 1. One approach would be to follow the argument of the quartet analog, with modifications throughout due to the rooted setting. Another would be to imitate the alternate proof of the quartet result given in Allman et al. (2019a), based on an extension of the intertaxon quartet distance of Rhodes (2019), but instead using the rooted triple distance also introduced in that work. The argument presented here is shorter than these approaches, as it leverages information about undirected rooted triple networks to obtain information about undirected quartet networks, and then applies the theory of Baños (2019).

The following result, extracted from the proof of Theorem 4 of Baños (2019), will be used. In it, and throughout this work, by a network modulo 2- and 3-cycles we mean the network obtained by suppressing all 2- and 3-cycles. Similarly, modulo directions of edges in 4-cycles means that all edges in 4-cycles are undirected. As a result, which of the edges in a 4-cycle are hybrid, and therefore which node is hybrid, is not indicated.

Lemma 1 (Baños (2019))

Let be a level-1 rooted binary topological phylogenetic network on . Let be the set of undirected quartet networks obtained from those displayed on by unrooting, suppressing all cycles of size 2 and 3, and undirecting all edges. Then modulo 2- and 3-cycles and directions of edges in 4-cycles, the semidirected unrooted network is determined by .

In order to apply this to rooted triples, we first recall some combinatorial properties of rooted triple and quartet networks.

Lemma 2 (Baños (2019))

Let be a level-1 unrooted semidirected binary quartet network. Then has no -cycles for , and at most one -cycle. If has a 4-cycle, then it has neither - nor -cycles. If there is no -cycle, then there are at most two -cycles, with at most one of these a -cycle.

Lemma 2 can be used to characterize possible cycles in a rooted triple network, by attaching an outgroup at the root. More specifically, by attaching an outgroup to the root of an -taxon network on taxa with we mean identifying the root of the network with the node on an edge and undirecting all tree edges. This gives a -taxon unrooted semidirected network. The rooted triple networks displayed on the original network are then in one-to-one correspondence with induced semidirected quartet networks containing on the new network. This construction yields the following.

Corollary 1

Let be a level-1 binary rooted triple network. Then has no -cycles for , and at most one -cycle in which case there are no - or -cycles. If there is no -cycle, then there are at most two -cycles, with at most one of these a -cycle.

Considering a rooted quartet network , and the impact of passing to its associated unrooted semidirected quartet network , Lemma 2 also immediately yields the following.

Corollary 2

Let be a level-1 rooted binary quartet network. Then has no -cycles for , and has at most a one 5-cycle or 4-cycle, but not both.

We now catalog the rooted quartet networks with 4- or 5-cycles, modulo smaller cycles.

Lemma 3

Let be a level-1 binary rooted quartet network with one -cycle or one -cycle. Then modulo 2- and 3- cycles and up to taxon relabelling, the LSA network is one of those shown in Figure 3. Thus displays either 1, 2, or 3 rooted triples with a 4-cycle.

Proof

Let be a rooted level-1 network on with a cycle of size or . By Corollary 2, is the only cycle of size greater than 3. Figure 3 shows the topologies, up to taxon relabeling, of all the rooted quartet networks with a - or -cycle and no - or -cycles, as determined by enumerating all possible locations for adding hybrid edges to a rooted 4-taxon tree. The top row of Figure 3 shows the quartet networks with exactly one displayed rooted triple, on , having a 4-cycle. The middle row shows the networks with exactly two displayed rooted triples, on and , having a 4-cycle. The bottom row shows those with exactly three displayed rooted triples, on , , and , having a 4-cycle. ∎

Figure 3: All rooted directed topological quartet networks with a single - or -cycle, and no other cycles, up to relabeling of taxa. Networks in the top row display exactly one rooted triple with a 4-cycle, those in the middle row display two, and those in the bottom row display three.

Now we proceed to the main result of this section.

Proposition 1

Let be a level-1 rooted binary topological phylogenetic network on . Let be the set of undirected rooted triple networks obtained from those displayed on by suppressing all cycles of size 2 and 3 and undirecting all edges. Then modulo 2- and 3-cycles and directions of edges in 4-cycles, the LSA network is determined by .

Proof

We first build a set of rooted quartet networks from . Let and let be the set of undirected rooted triple networks on any three elements of , so . By Corollary 2 and Lemma 3, there are , , , or elements of with a 4-cycle. We consider each possibility in turn, showing that we can determine the undirected rooted quartet network modulo 2- and 3-cycles.

If , all rooted triples in are trees and since has no 4- or 5-cycles by Lemma 3, the undirected LSA network modulo 2-and 3-cycles is a tree. By a well-known result for trees Semple and Steel (2005), determines modulo 2- and 3-cycles.

If , then modulo 2- and 3-cycles and relabelling of taxa, is isomorphic to one of the networks in the top row of Figure 3. But for all these networks if are the taxa in the rooted triple network with a 4-cycle, then the rooted 4-taxon network is obtained by attaching as an outgroup to it. Thus is determined modulo 2- and 3-cycles.

If , is isomorphic, modulo 2- and 3-cycles and relabeling, to one of the networks in the middle row of Figure 3. Note that for all those rooted quartet networks, the displayed rooted triple networks with 4-cycles are on and , and the 4-taxon network can be obtained from either of these by replacing or with a cherry on , thus determining modulo 2- and 3-cycles.

If , is isomorphic, modulo -, and -cycles and relabeling, to one of the networks in the bottom row of Figure 3. In both of these, there is exactly one taxon, , that is in all three rooted triple networks with 4-cycles, and there is exactly one taxon, , that has graph-theoretic distance 3 from in exactly one of the two rooted triples with 4-cycles it appears in. Thus we can determine which taxon is , and which is . For the remaining pair , if there is a taxon that is at distance 4 from in both 4-cycle rooted triple networks it appears in, then the 4-taxon network is the one shown on the left, and that taxon is . Otherwise, the network is the one shown on the right. In this case there is exactly one rooted triple network on and which has its third taxon at distance 2 from the root, and this determines . Thus we obtain the rooted 4-taxon network modulo 2- and 3-cycles, and hence modulo 2- and 3-cycles

With all rooted 4-taxon networks modulo 2- and 3-cycles determined, we attach an outgroup to all, giving the collection of all 5-taxon unrooted networks including , modulo 2- and 3-cycles, induced from the unrooted network formed by attaching to the root of . But the unrooted 4-taxon networks displayed on these 5-taxon ones form the collection of all 4-taxon undirected networks (possibly including ) modulo 2- and 3-cycles displayed on .

Lemma 1 now determines modulo 2- and 3-cycles, with directions of cut edges and edges in cycles of size , though not in 4-cycles. Rooting by the outgroup we recover the topology of modulo 2- and 3-cycles and directions of edges in 4-cycles. ∎

4 Expected pattern frequencies as convex sums

The theoretical logDet distance between taxa depends on the matrix of expected relative site-pattern frequencies in aligned sequences for taxa , under the mixture of coalescent mixtures model . The goal of this section is to show that on a level-1 ultrametric rooted triple network can be expressed as a convex combination of frequency matrices for networks with no cycles below the LSA of the taxa. In this way, we reduce the computation of to its computation on simpler networks. This is complicated somewhat by the fact that the convex combination may have terms which are expected pattern frequencies conditioned on a pair of lineages coalescing below a certain node in a network.

The lemmas that follow often involve modifying a network by removing a hybrid edge, to obtain a new network . If one hybrid edge in a cycle is removed, the hybrid node is then suppressed as the other hybrid edge is joined to the descendant tree edge and given the induced length and population size. We retain all other edge lengths and population sizes, as well as hybrid parameters for unaffected cycles. The parameters for the substitution process describing sequence evolution on gene trees are also retained. If denotes the full set of parameters associated to , then denotes the full set of parameters associated to in this way. Notation such as or denotes the dependence of on the parameters or , which include the network or .

Figure 4: Examples of level-1 rooted triple networks with -, -, and -cycles. While multiple -cycles may be present along any pendant edge shown here in dashes, there can be at most two -cycles, whose hybrid nodes are located on a dashed pendant edge. At most one -cycle can be present. Site-pattern frequency matrices from the model on rooted triple networks with these types of cycles are convex combinations of such matrices for 1, 2, or 4 networks without those cycles, as shown by Lemmas 4 and 5.

The most straightforward network simplifications occur when the hybrid node of a cycle has a single descendant leaf, as depicted by the example -, - and -cycles in Figure 4.

Lemma 4

(Removing -cycles) Let be a binary level-1 ultrametric rooted triple network on and let be a -cycle in with hybrid edges . Let be the network obtained from by removing . Then, under the model for any ,

Proof

Since the hybrid node of has only one descendant, the combined coalescent and substitution process on can be expressed as a linear combination of those processes on , weighted by . That is, for any ,

But and only differ by and which have the same length, though possibly different population sizes. However, since only one lineage can be present in the population for those edges, those population sizes have no impact in model , so . Since , the claim follows. ∎

If a network has multiple -cycles, then applying Lemma 4 repeatedly gives where is a rooted network with no -cycles obtained from by deleting one hybrid edge in each of the -cycles on .

Lemma 5

(Decomposing - and -cycles) Let be a binary level-1 ultrametric rooted triple network on and let be either a - or a -cycle on . Let be the hybrid edges of with . Let be the network obtained from by removing , . Then, under the model for any ,

Proof

Since the hybrid node of has only one descendant, we can express the combined coalescent and substitution process on as a linear combination of the processes of the , with coefficients , . ∎

A level-1 rooted triple network may have one -cycle, one -cycle, or two -cycles. In the last case, Lemma 5 may be applied twice, to express the pattern frequency matrix under the model as a convex combination of four such matrices for networks with no -cycles.

With Lemma 4 this shows that computation of the matrix of relative site-pattern frequencies of a level-1 ultrametric rooted triple network reduces to cases where there are no -, -, or -cycles. The effects of - and -cycles are more complicated, however, as a coalescent event may or may not occur below the hybrid nodes of such cycles.

The following definition facilitates studying the impact of such cycles. In it a node may be either an existing node or a new node introduced along an edge of a network, with appropriate division of the original edge length and population function. Although strictly speaking this second case passes out of the class of binary networks, we allow this only to simplify reference to intermediate states of the coalescent process.

Definition 9

Let

be the random variable giving the number of lineages at node

under the NMSC. With denoting the set of taxa below , has sample space .

When is clear from context we write . We also use the notation

to denote the joint distribution of site patterns conditioned on

under the model with parameters .

Figure 5: (Top) A rooted level-1 ultrametric network on , with the -cycle closest to LSA() shown. (Bottom) The networks , , and obtained from , respectively, as described in Lemma 6. Note that there may be additional cycles along the dashed lines, with hybrid nodes above node and taxon .
Lemma 6

(Decomposing -cycles) Let be a binary level-1 ultrametric rooted triple network on without - or -cycles. Suppose, as depicted in Figure 5, is a -cycle on , with edges from node to hybrid node , hybridization parameters , leaf descendants of , and no cycles below . Denote by , the network obtained from by removing , and by the network obtained from by deleting all edges and nodes below and attaching edges and of appropriate length so that is ultrametric. Then, under the model for any ,

Proof

Since the structure of the model for , , and is identical below , we may also use to denote and . Thus

(2)

But since for by the argument used for Lemma 4, and the identity ,

Substituting this into equation (2) and using yields the claim. ∎

Note that while and of Lemma 6 have the same topology and edge lengths, the hybrid edges may have different population sizes. Thus is possible. This is in contrast to the argument on removing -cycles in Lemma 4, in which hybrid edge population sizes did not play a role.

Since a level-1 3-taxon rooted network cannot have a -cycle above a -cycle, Lemma 6 can be applied recursively to the , to eliminate all -cycles. Thus the remaining complication to producing an expression for as a convex combination of such matrices for networks without -, -, or -cycles is the presence of terms of the form where has cherry and neither - nor -cycles. Such terms are handled with the following.

Lemma 7

(Decomposing -cycles conditioned on coalescence) Let be a binary level-1 ultrametric rooted triple network on without - or -cycles, and on which form a cherry. Let be the parent of the common parent of . Denote by a network obtained from by removing one hybrid edge from each -cycle.

If has no -cycle, then

If has a -cycle, with hybrid edges and hybridization parameters , then let be the network obtained from by removing , . Then

Proof

Conditioned on , there is only one lineage in any population above and below the hybrid node of a -cycle, if such a cycle is present, or the LSA otherwise. Thus, as in the proof of Lemma 4, no -cycle will have any effect on the joint distribution. If there is no -cycle on this yields the claim. If there is a -cycle, since only one lineage reaches the hybrid node of the -cycle, we obtain the claim as in the proof of Lemma 5. ∎

Lemma 8

(Decomposing -cycles) Let be a binary level-1 ultrametric rooted triple network on with no cycles below its LSA except a -cycle . Let denote the hybrid node of , and the hybrid edges with hybridization parameters and lengths , as depicted at the top of Figure 6. Let , and be the networks derived from shown at the bottom of Figure 6. Then, under the model , for any , with ,

Proof

Observe that

(3)

Since and ,

Using this and in equation (3) yields the claim. ∎

Figure 6: (Top) A rooted level-1 ultrametric network with a -cycle, and (Bottom) the networks , , , and used in Lemma 8. Although only topology and branch lengths are shown, population size parameters for each edge of are obtained from the corresponding ones of .

5 Theoretical logDet distances

In this section, we show that, under the mixture of coalescent mixtures model on an ultrametric level-1 rooted triple network, the theoretical logDet distances between taxa determine most topological features of the network. The previous section established that the pattern frequency matrices for the model on such networks can be expressed as convex combinations of those on simpler networks (possibly subject to conditioning), whose only cycles are -cycles located above , such as depicted in Figure 2. The following algebraic lemma is key to drawing conclusions about the determinants of such linear combinations of matrices.

Lemma 9 (Allman et al. (2019b), Lemma 3.1)

Suppose for each , and are symmetric positive definite matrices such that for every with the inequality strict for some and some . For , let

Then

Analyzing the pattern frequency matrix for networks with -cycles above requires a detailed look at the coalescent process in such a chain of -cycles. For a simple case, assume lineages and enter the single cycle chain depicted in Figure 7. Population functions , , and are fixed for each edge, where for convenience, we shift domains from the convention in Section 2.2 so that is defined on , on , and on .

The probability density for time to coalescence of the lineages entering at the bottom node () can be calculated piecewise as follows: For ,

as given in Allman et al. (2019b).

For ,

where is the probability of no coalescence before , and for

Finally, for , with the probability of no coalescence before ,

Figure 7: A 2-cycle and adjacent tree edges in a species network, depicted (Left) with pipes whose width represent population sizes, and (Right) as a schematic.

It is straightforward to extend this analysis of to a chain with an arbitrary number of 2-cycles. Since we will not need an explicit formula for the distribution of coalescent times for two lineages entering such a chain of 2-cycles, we omit a complete derivation, and only state the properties of it that we use.

Formally, a chain of 2-cycles is a species network with leaf , internal vertices , , , , with root , tree edges , and hybrid edges , , together with edge lengths, piecewise-continuous population size functions on each edge, including above the root, and hybrid parameters for each pair of hybrid edges .

Using the technical assumptions given in Subsection 2.2, it is straightforward to deduce the following.

Lemma 10

Consider a fixed chain of 2-cycles with leaf . Let

denote the probability density function under the NMSC for the time

of coalescence of two lineages entering the chain at . Then is piecewise continuous, and for all .

The next three technical lemmas generalize Lemmas 4.1, 4.4, and 4.5 of Allman et al. (2019b) from a tree to a network setting. These culminate in Proposition 2 below, which justifies the application of Lemma 9.

Lemma 11

Let be the probability density function under the NMSC for the time of coalescence of two lineages entering a chain of 2-cycles, and for times let be the conditional density given

. Then the cumulative distribution functions for

and satisfy

with the inequality strict on some interval.

Proof

Since for all , the inequality is immediate for . Since using Lemma 10 we have for , the inequality is strict on a subinterval.

For , let and , so

Differentiating and using Lemma 10 shows is decreasing for . Since as , this implies , as claimed. ∎

Lemma 12

Let be probability density functions on , with cumulative distribution functions , such that for all , with the inequality strict on some interval. Let for a positive, piecewise-continuous on such that . For let

Then if ,

while for